HS650 TERM PAPER PROJECT
Winter 2018, DSPA (HS650)
Name: Omkar Sunkersett
I certify that the following paper represents my own independent work and conforms with the guidelines of academic honesty described in the UMich student handbook.
TOPIC: BUILDING A PREDICTIVE MODEL FOR DETECTING DIABETES MELLITUS IN PATIENTS
ABSTRACT
## This term project paper studies a serious disease -- Diabetes Mellitus (Type II) -- widespread in the world. Diabetes can lead to a variety of health complications such as permanent blindness caused by glaucoma of the eyes. Diabetes is also a hereditary condition in many cases and depends upon a variety of physical factors of the human body. This research project explores some of the physical factors of the human body to build a predictive model for detecting diabetes mellitus in patients based upon real medical records obtained from a small population in the United States. It will help researchers to develop better predictive models for detecting diabetes mellitus in patients in future.
INTRODUCTION
## Diabetes Mellitus is a serious disease in which the body’s ability to produce or respond to the hormone insulin is impaired, resulting in abnormal metabolism of carbohydrates and elevated levels of glucose in the blood and urine [1]. It begins with insulin resistance, which is a condition in which the body's cells fail to respond to the hormone insulin in a proper manner. As this disease progresses, the patient may also develop a condition in which their body no longer produces insulin. Though hereditary factors may play a role in the onset of diabetes, the most common causes are excess of body weight, typically body mass index (BMI) and lack of proper exercise [2].
##
## It is estimated that over 400 million people worldwide have diabetes mellitus with 90% of cases being diagnosed as type-II [3, 4, 5]. One thing about diabetes mellitus is that it is not gender specific as the rate of diabetes is nearly the same for men as that for women [6]. The global economic cost for diabetes is above USD 600 Billion with the cost being above USD 200 Billion in the USA alone [7, 8]. This research project examines some physical body parameters obtained from patients in the United States to build a predictive model for detecting the onset of diabetes in a certain population of people based in the United States using supervised and unsupervised machine learning techniques such as classification, regression, clustering and neural network analysis [9].
PROBLEM DEFINITION AND DATA
## This is an exploratory data analysis and machine learning problem on a dataset obtained from data.world [10]. The original dataset is available in the UCI Machine Learning Repository [11]. Exploratory analysis should be performed to engage in a preliminary analysis of the available data. Preliminary tests help discover relationships amongst the variables of the dataset. These relationships help us decide the important features of the dataset. Each feature is basically a variable or an attribute of the dataset. The attributes of the dataset can be ranked according to their relative importance so that they can be selected as features of the predictive model. My approach uses both supervised and unsupervised learning techniques such as regression or classification and k-means or neural networks, respectively.
##
## The diabetes dataset has been obtained from data.world. It consists of nine attributes: Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, Body Mass Index (BMI), Diabetes Pedigree Function, Age and Outcome. These attributes are explained below in more detail [10, 11] --
## 1. Pregnancies: This indicates the number of times the female subject has been pregnant.
## 2. Glucose: This indicates the plasma glucose concentration every 2 hours in an oral glucose tolerance test.
## 3. Blood Pressure: This indicates the diastolic blood pressure of the female subject (in mm Hg).
## 4. Skin Thickness: This indicates the triceps skin fold thickness for the female subject (in mm).
## 5. Insulin: This indicates the female subject's serum insulin level every 2 hours (mu U/ml).
## 6. Body Mass Index (BMI): This indicates the body mass index of the female subject (weight in kg/(height in m)^2).
## 7. Diabetes Pedigree Function: This is a measure of the genetic influence/hereditary risk to the onset of diabetes mellitus (pedi).
## 8. Outcome: This indicates whether a female subject was diagnosed with diabetes mellitus or not. A value of 1 indicates that the female subject tested positive, whereas a value of 0 indicates that the female subject tested negative.
METHODOLOGY
## HYPOTHESIS / BASELINE: We are not sure which factors contribute to the onset of diabetes mellitus in patients and hence cannot build predictive models for detecting diabetes in patients.
## This problem can be explored in a number of ways. For the scope of this project, I have split my problem-solving approach into multiple steps, from loading the required packages to comparing the final results. The first step is to install and load the required packages if they are not installed in R Studio. The second step is to load the dataset to study the dimensions, structure, summary and distribution of its variables. The third step is to remove the missing values from the dataset as these can affect our analysis in a negative manner. The fourth step is to study the skewness of the variables of the dataset using histogram analysis. The fifth step is to study the boxplots and density curves of some important variables such as Diabetes Pedigree Function and Plasma Glucose. The sixth step is to examine the correlation of variance for each pair of variables of the dataset and note the pairs that have high correlation. The seventh step is an optional one - I have used other techniques to plot the correlation of variance of the variables graphically. The eigth step is to perform an exploratory analysis on the different age groups using techniques such as scatterplots, lined bar plots, stacked bar plots, box plots and classification pair plots. The ninth step is to perform dimensonality reduction on the variables of the dataset using techniques such as t-distributed stochastic neighbor embedding (t-SNE) and principal component analysis (PCA). The tenth step is to split the dataset into training and testing sets using a suitable training to testing ratio. The eleventh step is to train and test a set of predictive models using supervised and unsupervised machine learning techniques such as classification and clustering, respectively. In this regard, it is important to select the right features for the predictive model. The final step is to improve the models and report the results in order to determine the best machine learning technique. The results include parameters such as model accuracy, sensitivity and specificity. These results are discussed in a later section of this report [12].
RESULTS / OBSERVATIONS
## Below are the results and observations from my analysis --
Step 1: Loading the Packages for the Purpose of Analysis
## Loading all the packages for the purpose of analysis. If a package is not present, the code tries to download and install the package, so please make sure that you are connected to the Internet.
packages_vector <- c("tidyr", "gridExtra", "e1071", "MASS", "PerformanceAnalytics", "pysch", "ggplot2", "GGally", "ggcorrplot", "Rtsne", "ggthemes", "rvest", "factoextra", "graphics", "corrplot", "mclust", "caret", "C50", "stats", "cluster", "matrixStats", "rpart", "rpart.plot", "RWeka", "randomForest", "neuralnet", "kernlab", "party", "class", "gbm", "ada", "TTR", "highcharter", "knitr", "kableExtra")
packages_to_install <- packages_vector[!(packages_vector %in% installed.packages()[,"Package"])]
if(length(packages_to_install)) install.packages(packages_to_install, repos = "http://cran.us.r-project.org")
options("java.home"="/Library/Java/JavaVirtualMachines/jdk-9.0.1.jdk/Contents/Home/lib")
Sys.setenv(LD_LIBRARY_PATH='$JAVA_HOME/server')
dyn.load('/Library/Java/JavaVirtualMachines/jdk-9.0.1.jdk/Contents/Home/lib/server/libjvm.dylib')
library(tidyr)
library(gridExtra)
library(e1071)
library(MASS)
library(PerformanceAnalytics)
library(psych)
library(ggplot2)
library(GGally)
library(ggcorrplot)
library(Rtsne)
library(ggthemes)
library(rvest)
library(factoextra)
library(graphics)
library(corrplot)
library(mclust)
library(caret)
library(C50)
library(stats)
library(cluster)
library(matrixStats)
library(rpart)
library(rpart.plot)
library(RWeka)
library(randomForest)
library(neuralnet)
library(kernlab)
library(party)
library(class)
library(gbm)
library(ada)
library(highcharter)
library(knitr)
library(kableExtra)
Step 2: Loading the Diabetes Dataset and Printing its Properties
## While loading the dataset, the code factors the outcome as either Postive (1) or Negative (0), respectively. It then prints the dimensions of the dataset in terms of the number of rows and columns. It also prints the structure of the dataset before printing the summary, header information and distribution of the outcome.
df <- read.csv("/Users/omkarsunkersett/Downloads/diabetes.csv", header = TRUE, stringsAsFactors = FALSE)
df$Outcome <- as.factor(df$Outcome)
levels(df$Outcome) <- c("Negative","Positive")
dim(df)
## [1] 768 9
str(df)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : Factor w/ 2 levels "Negative","Positive": 2 1 2 1 2 1 2 1 2 2 ...
summary(df)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Negative:500
## Positive:268
##
##
##
##
head(df)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 Positive
## 2 0.351 31 Negative
## 3 0.672 32 Positive
## 4 0.167 21 Negative
## 5 2.288 33 Positive
## 6 0.201 30 Negative
prop.table(table(df$Outcome))
##
## Negative Positive
## 0.6510417 0.3489583
Step 3: Handling the missing values in the dataset
## The below code removes all of the rows that contain a zero value because such rows are not significant for the purpose of analysis. The code replaces the zero values with NAs and drops these rows using the function drop_na(). It then prints the dimensions of the dataset in terms of the number of rows and columns along with the structure of the dataset, summary, header information and distribution of the outcome. We can observe a decrease in the number of rows by about 55%.
df[df == 0] <- NA
df <- df %>% drop_na()
dim(df)
## [1] 336 9
str(df)
## 'data.frame': 336 obs. of 9 variables:
## $ Pregnancies : int 1 3 2 1 5 1 1 3 11 10 ...
## $ Glucose : int 89 78 197 189 166 103 115 126 143 125 ...
## $ BloodPressure : int 66 50 70 60 72 30 70 88 94 70 ...
## $ SkinThickness : int 23 32 45 23 19 38 30 41 33 26 ...
## $ Insulin : int 94 88 543 846 175 83 96 235 146 115 ...
## $ BMI : num 28.1 31 30.5 30.1 25.8 43.3 34.6 39.3 36.6 31.1 ...
## $ DiabetesPedigreeFunction: num 0.167 0.248 0.158 0.398 0.587 0.183 0.529 0.704 0.254 0.205 ...
## $ Age : int 21 26 53 59 51 33 32 27 51 41 ...
## $ Outcome : Factor w/ 2 levels "Negative","Positive": 1 2 2 2 2 1 2 1 2 2 ...
summary(df)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 1.000 Min. : 56.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.:21.00
## Median : 3.000 Median :119.0 Median : 70.00 Median :28.50
## Mean : 3.851 Mean :122.3 Mean : 70.24 Mean :28.66
## 3rd Qu.: 6.000 3rd Qu.:144.0 3rd Qu.: 78.00 3rd Qu.:36.00
## Max. :17.000 Max. :197.0 Max. :110.00 Max. :52.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 15.0 Min. :18.20 Min. :0.0850 Min. :21.00
## 1st Qu.: 76.0 1st Qu.:27.80 1st Qu.:0.2680 1st Qu.:24.00
## Median :125.5 Median :32.75 Median :0.4465 Median :28.00
## Mean :155.3 Mean :32.30 Mean :0.5187 Mean :31.84
## 3rd Qu.:190.0 3rd Qu.:36.25 3rd Qu.:0.6883 3rd Qu.:38.00
## Max. :846.0 Max. :57.30 Max. :2.3290 Max. :81.00
## Outcome
## Negative:225
## Positive:111
##
##
##
##
head(df)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 4 1 89 66 23 94 28.1
## 7 3 78 50 32 88 31.0
## 9 2 197 70 45 543 30.5
## 14 1 189 60 23 846 30.1
## 15 5 166 72 19 175 25.8
## 19 1 103 30 38 83 43.3
## DiabetesPedigreeFunction Age Outcome
## 4 0.167 21 Negative
## 7 0.248 26 Positive
## 9 0.158 53 Positive
## 14 0.398 59 Positive
## 15 0.587 51 Positive
## 19 0.183 33 Negative
prop.table(table(df$Outcome))
##
## Negative Positive
## 0.6696429 0.3303571
Step 4: Examining the histograms
## The below code generates the histograms for each variable of the dataset using different technqiues. We observe that the Triceps Skin Fold Thickness and Diastolic Blood Pressure variables have a nearly normal distribution, whereas the remaining variables of the dataset are skewed towards to the right.
par(mfrow = c(2, 2))
hist(df$Pregnancies)
hist(df$Glucose)
hist(df$BloodPressure)
hist(df$SkinThickness)

hist(df$Insulin)
hist(df$BMI)
hist(df$DiabetesPedigreeFunction)
hist(df$Age)

ggplot(reshape2::melt(df), aes(x=value, fill=variable)) + geom_histogram(binwidth=5) + facet_wrap(~variable)

grid.arrange(ggplot(df, aes(x=df[,1])) + geom_density() + xlab("Pregnancies"), ggplot(df, aes(x=df[,1], col=Outcome)) + geom_density(alpha=0.4) + xlab("Pregnancies"), ncol=2, top=paste("Pregnancies", " [ Skew:",skewness(df[,1]),"]"))

grid.arrange(ggplot(df, aes(x=df[,2])) + geom_density() + xlab("Glucose"), ggplot(df, aes(x=df[,2], col=Outcome)) + geom_density(alpha=0.4) + xlab("Glucose"), ncol=2, top=paste("Glucose", " [ Skew:",skewness(df[,2]),"]"))

grid.arrange(ggplot(df, aes(x=df[,3])) + geom_density() + xlab("Blood Pressure"), ggplot(df, aes(x=df[,3], col=Outcome)) + geom_density(alpha=0.4) + xlab("Blood Pressure"), ncol=2, top=paste("Blood Pressure", " [ Skew:",skewness(df[,3]),"]"))

grid.arrange(ggplot(df, aes(x=df[,4])) + geom_density() + xlab("Skin Thickness"), ggplot(df, aes(x=df[,4], col=Outcome)) + geom_density(alpha=0.4) + xlab("Skin Thickness"), ncol=2, top=paste("Skin Thickness", " [ Skew:",skewness(df[,4]),"]"))

grid.arrange(ggplot(df, aes(x=df[,5])) + geom_density() + xlab("Insulin"), ggplot(df, aes(x=df[,5], col=Outcome)) + geom_density(alpha=0.4) + xlab("Insulin"), ncol=2, top=paste("Insulin", " [ Skew:",skewness(df[,5]),"]"))

grid.arrange(ggplot(df, aes(x=df[,6])) + geom_density() + xlab("Body Mass Index"), ggplot(df, aes(x=df[,6], col=Outcome)) + geom_density(alpha=0.4) + xlab("Body Mass Index"), ncol=2, top=paste("Body Mass Index", " [ Skew:",skewness(df[,6]),"]"))

grid.arrange(ggplot(df, aes(x=df[,7])) + geom_density() + xlab("Diabetes Pedigree Function"), ggplot(df, aes(x=df[,7], col=Outcome)) + geom_density(alpha=0.4) + xlab("Diabetes Pedigree Function"), ncol=2, top=paste("Diabetes Pedigree Function", " [ Skew:",skewness(df[,7]),"]"))

grid.arrange(ggplot(df, aes(x=df[,8])) + geom_density() + xlab("Age"), ggplot(df, aes(x=df[,8], col=Outcome)) + geom_density(alpha=0.4) + xlab("Age"), ncol=2, top=paste("Age", " [ Skew:",skewness(df[,8]),"]"))

Step 5: Examining some Boxplots and Density Curves
## Figure 1 is a boxplot of the Diabetes Pedigree Function for each Test Result (Positive or Negative). The boxplot indicates that the median value of the pedigree function is higher for the tests that are positive. The inter-quartile range for this function is slightly greater for the tests that are positive.
## Figure 2 is the density curve for the variable Plasma Glucose for both outcomes (positive or negative). The density curve of the negative outcome has a higher peak value than that of the positive outcome. Notice how these density curves are skewed oppositely.
par(mfrow = c(1, 2))
boxplot(DiabetesPedigreeFunction ~ Outcome, data = df, ylab = "Diabetes Pedigree Function", xlab = "Test Results", main = "Figure 1", outline = FALSE)
positive <- subset(df, df$Outcome=='Positive')
negative <- subset(df, df$Outcome=='Negative')
plot(density(positive$Glucose), xlim = c(0, 250), ylim = c(0.00, 0.02), xlab = "Plasma Glucose", main = "Figure 2", col = "red", lwd = 2)
lines(density(negative$Glucose), col = "black", lwd = 2)
legend("topleft", col = c("red", "black"), legend = c("Positive", "Negative"), lwd = 2, bty = "n")

Step 6: Examining the Correlation of Variance for the Dataset
## The below figures depict the correlation of variance using both chart.Correlation() and pairs.panels(). We observe that the correlation is high between the variable pairs Pregnancies & Age, Skin Thickness & BMI, and Glucose & Insulin.
chart.Correlation(df[,-9], histogram=TRUE, col="grey10", pch=1, main="Chart.Correlation of Variance")

pairs.panels(df[,-9], method="pearson", hist.col = "#1fbbfa", density=TRUE, show.points=TRUE, pch=1, lm=TRUE, cex.cor=1, smoother=FALSE, stars=TRUE, main="Pairs.Panels of Variance")

Step 7: Performing Step 6 using other Functions
## The below figures depict the correlation of variance using corrplot(), ggpairs(), ggcorr() and ggcorrplot(). We observe that the correlation is high between the variable pairs Pregnancies & Age, Skin Thickness & BMI, and Glucose & Insulin.
corrplot(cor(df[,-9]))

corrplot(cor(df[,-9]), method = "number", type = "upper", title = "\nCorrelation Plot of Variance", bg = 0xFF0000, addgrid.col = "darkgray")

ggpairs(df, aes(color=Outcome, alpha=0.80), lower=list(continuous="smooth")) + theme_bw() + labs(title="Correlation Plot of Variance (wrt. Outcome)") + theme(plot.title=element_text(face='bold',color='black',hjust=0.5,size=12))

ggcorr(df[,-9], name = "corr", label = TRUE) + theme(legend.position="none") + labs(title="Correlation Plot of Variance (figure 2)") + theme(plot.title=element_text(face='bold',color='black',hjust=0.5,size=12))

ggcorrplot(round(cor(df[,-9]), 1), hc.order = TRUE, type = "lower", lab = TRUE, lab_size = 3, method="circle", colors = c("red", "green", "blue"), title="Correlation Plot of Variance (figure 3)", ggtheme=theme_bw)

Step 8: Performing some Exploratory Analysis for Age Groups
## Generating some scatterplots, lined bar plots, stacked bar plots, box plots and classification pair plots for the age groups.
ggplot(data=df, aes(Glucose, Pregnancies)) + geom_jitter(aes(colour = Outcome))

ggplot(data=df, aes(Glucose,fill= Outcome)) + geom_bar(color = "black", width = 1) + xlab("Plasma Glucose") + ylab("Number of People") + theme(axis.text.x=element_text(angle=75, hjust=1)) + ggtitle("Plasma Glucose and Test Results")

df_ag <- df
df_ag$AgeGroup <- cut(df_ag$Age, breaks = c(20,35,50,100), labels = FALSE) %>% as.factor()
df_ag$AgeGroup <- as.integer(df_ag$AgeGroup)
ggplot(data=df_ag, aes(AgeGroup, fill = Outcome),y = (..count..)/sum(..count..)) + geom_bar(color = "black", width = 0.7) + xlab("AgeGroup 20-35, 35-50, 50-100") + ylab("Number of People") + theme(axis.text.x=element_text(angle=75, hjust=1)) + ggtitle("Age Group and Test Results") + stat_bin(geom = "text",aes(label = paste(round((..count..)/sum(..count..)*100), "%")),vjust = 2)

df_ag$AgeGroup <- as.factor(df_ag$AgeGroup)
ggplot(data=df_ag, aes(Pregnancies, fill = AgeGroup),y = (..count..)/sum(..count..)) + geom_bar(color = "black", width = 0.7) + xlab("Pregnancies") + ylab("Number of People") + theme(axis.text.x=element_text(angle=75, hjust=1)) + ggtitle("Age Group and Pregnancies")

myplot <- function(x,y) {
ggplot(data = df_ag, aes(eval(parse(text = x)), eval(parse(text = y))))+geom_boxplot(outlier.colour = "blue") + xlab(x) + ylab(y) + geom_jitter(alpha=0.2, aes(colour = Outcome))
}
p1 <- myplot("AgeGroup","SkinThickness")
p2 <- myplot("AgeGroup","Pregnancies")
p3 <- myplot("AgeGroup","Glucose")
p4 <- myplot("AgeGroup","BloodPressure")
p5 <- myplot("AgeGroup","BMI")
p6 <- myplot("AgeGroup","Insulin")
grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 3)

clp <- clPairs(df[,-9], classification = df$Outcome, lower.panel = NULL)
clPairsLegend(0.1, 0.4, class = clp$class, col = clp$col, pch = clp$pch, title = "Classification Pairs Plot")

Step 9: Performing Dimensionality Reduction on the Variables
## Using techniques such as t-Distributed Stochastic Neighbor Embedding (t-SNE) and Principal Component Analysis (PCA) to perform dimensionality reduction on the variables of the dataset.
tsne <- function(x) {
for (i in c(x)) {
result <- Rtsne(df[,-9], perplexity = x, pca = TRUE, check_duplicates = FALSE)
return(result$Y)
}
}
x <- c(10, 20, 30, 40, 50)
df_tsne <- data.frame(tsne(x[1]), tsne(x[2]), tsne(x[3]), tsne(x[4]), tsne(x[5]), class = df$Outcome)
xs <- c(1,3,5,7,9)
ys <- c(2,4,6,8,10)
ggplot(df_tsne,aes(x=(df_tsne[,xs[1]]),y=(df_tsne[,ys[1]]),color=class)) + geom_point(size=1.5, alpha=0.6) + labs(x="", y="") + theme(axis.text.x=element_blank(), axis.text.y=element_blank()) + ggtitle(paste("Perplexity:", x[1])) + scale_color_tableau()

ggplot(df_tsne,aes(x=(df_tsne[,xs[2]]),y=(df_tsne[,ys[2]]),color=class)) + geom_point(size=1.5, alpha=0.6) + labs(x="", y="") + theme(axis.text.x=element_blank(), axis.text.y=element_blank()) + ggtitle(paste("Perplexity:", x[2])) + scale_color_tableau()

ggplot(df_tsne,aes(x=(df_tsne[,xs[3]]),y=(df_tsne[,ys[3]]),color=class)) + geom_point(size=1.5, alpha=0.6) + labs(x="", y="") + theme(axis.text.x=element_blank(), axis.text.y=element_blank()) + ggtitle(paste("Perplexity:", x[3])) + scale_color_tableau()

ggplot(df_tsne,aes(x=(df_tsne[,xs[4]]),y=(df_tsne[,ys[4]]),color=class)) + geom_point(size=1.5, alpha=0.6) + labs(x="", y="") + theme(axis.text.x=element_blank(), axis.text.y=element_blank()) + ggtitle(paste("Perplexity:", x[4])) + scale_color_tableau()

ggplot(df_tsne,aes(x=(df_tsne[,xs[5]]),y=(df_tsne[,ys[5]]),color=class)) + geom_point(size=1.5, alpha=0.6) + labs(x="", y="") + theme(axis.text.x=element_blank(), axis.text.y=element_blank()) + ggtitle(paste("Perplexity:", x[5])) + scale_color_tableau()

mu <- apply(df[,-9], 2, mean)
df.center <- as.matrix(df[,-9] - mean(mu))
S <- cov(df.center)
eigen(S)
## eigen() decomposition
## $values
## [1] 1.446559e+04 6.298215e+02 1.716000e+02 1.024493e+02 7.995919e+01
## [6] 1.901797e+01 5.069121e+00 1.025501e-01
##
## $vectors
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.0029050033 0.0366779083 -0.093374911 0.03830857 0.174415051
## [2,] -0.1572105456 0.9611908429 0.212947338 -0.03438768 -0.069676519
## [3,] -0.0113281672 0.1518849310 -0.774598045 0.40899636 -0.450862292
## [4,] -0.0177168282 0.0593180072 -0.396029981 -0.81455047 0.034211700
## [5,] -0.9870008329 -0.1596748873 -0.006298551 0.01567225 -0.003049791
## [6,] -0.0132746353 0.0252155158 -0.221791407 -0.35674876 -0.089228226
## [7,] -0.0004917723 0.0001696004 -0.001014924 -0.00303782 0.001045665
## [8,] -0.0220701158 0.1484833852 -0.373979343 0.19762317 0.867355322
## [,6] [,7] [,8]
## [1,] 0.023166040 0.978514939 3.333198e-03
## [2,] 0.002645101 -0.002471398 -6.860792e-05
## [3,] -0.077870570 -0.013450894 9.616938e-04
## [4,] -0.417899796 -0.004374280 -2.335738e-03
## [5,] -0.005673450 0.002519608 -3.964990e-04
## [6,] 0.902536697 -0.013636420 -2.595313e-03
## [7,] 0.001462175 -0.003613076 9.999866e-01
## [8,] 0.064385229 -0.205175511 -1.557695e-03
pca1 <- prcomp(as.matrix(df[,-9]), center = TRUE)
summary(pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 120.2730 25.0962 13.09962 10.12172 8.94199 4.36096
## Proportion of Variance 0.9349 0.0407 0.01109 0.00662 0.00517 0.00123
## Cumulative Proportion 0.9349 0.9756 0.98665 0.99327 0.99844 0.99967
## PC7 PC8
## Standard deviation 2.25147 0.32023
## Proportion of Variance 0.00033 0.00001
## Cumulative Proportion 0.99999 1.00000
eigen <- get_eigenvalue(pca1)
eigen
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 1.446559e+04 9.348556e+01 93.48556
## Dim.2 6.298215e+02 4.070295e+00 97.55585
## Dim.3 1.716000e+02 1.108985e+00 98.66484
## Dim.4 1.024493e+02 6.620907e-01 99.32693
## Dim.5 7.995919e+01 5.167456e-01 99.84367
## Dim.6 1.901797e+01 1.229058e-01 99.96658
## Dim.7 5.069121e+00 3.275979e-02 99.99934
## Dim.8 1.025501e-01 6.627417e-04 100.00000
plot(pca1)

qualit_vars <- as.factor(df$Outcome)
biplot(pca1, choices = 1:2, scale = 1, pc.biplot = FALSE)

fviz_pca_biplot(pca1, axes = c(1, 2), geom = c("point", "text"), col.ind = "black", col.var = "steelblue", label = "all", invisible = "none", repel = T, habillage = qualit_vars, palette = NULL, addEllipses = TRUE, title = "PCA - Biplot")

Step 10: Dividing the Original Dataset into Training and Testing Sets
## The training dataset contains 70% of the rows from the original dataset, whereas the testing dataset contains the remaining 30%. It is important to check the dimension and distribution of the resulting datasets.
set.seed(12345)
ckpt <- sample(1:nrow(df), floor(0.70*nrow(df)))
train <- df[ckpt,]
test <- df[-ckpt,]
dim(train)
## [1] 235 9
dim(test)
## [1] 101 9
prop.table(table(train$Outcome))
##
## Negative Positive
## 0.6893617 0.3106383
prop.table(table(test$Outcome))
##
## Negative Positive
## 0.6237624 0.3762376
Step 11: Training and Testing a Set of Models using Supervised and Unsupervised Machine Learning Techniques
## IMPORTANT NOTE: I have selected all of the attributes (except for Outcome) as features of all of my predictive models. This is because all of the attributes have some effect on the Outcome (Test Results for Diabetes Mellitus). Hence, feature selection has been performed by selecting all of the attributes of the dataset. Moreover, I have used the confusion matrix for calculating the accuracy, sensitivity and specificity of my models. These three are the metrics of evaluation for my results.
Using the C5.0 Classification Model –
set.seed(1234)
c5_model <- C5.0(train[,-9], train$Outcome)
c5_model
##
## Call:
## C5.0.default(x = train[, -9], y = train$Outcome)
##
## Classification Tree
## Number of samples: 235
## Number of predictors: 8
##
## Tree size: 16
##
## Non-standard options: attempt to group attributes
summary(c5_model)
##
## Call:
## C5.0.default(x = train[, -9], y = train$Outcome)
##
##
## C5.0 [Release 2.07 GPL Edition] Fri Apr 20 20:29:44 2018
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 235 cases (9 attributes) from undefined.data
##
## Decision tree:
##
## Glucose <= 127:
## :...DiabetesPedigreeFunction <= 0.673: Negative (107/5)
## : DiabetesPedigreeFunction > 0.673:
## : :...Age <= 43: Negative (33/8)
## : Age > 43: Positive (5)
## Glucose > 127:
## :...Age <= 24:
## :...BMI <= 38.7: Negative (13)
## : BMI > 38.7:
## : :...Insulin <= 335: Positive (2)
## : Insulin > 335: Negative (2)
## Age > 24:
## :...Glucose > 154:
## :...Insulin <= 83: Negative (2)
## : Insulin > 83: Positive (31/2)
## Glucose <= 154:
## :...Glucose > 152: Negative (5)
## Glucose <= 152:
## :...Age > 55: Negative (3)
## Age <= 55:
## :...BloodPressure > 76: Positive (18/1)
## BloodPressure <= 76:
## :...DiabetesPedigreeFunction > 0.598: Negative (3)
## DiabetesPedigreeFunction <= 0.598:
## :...DiabetesPedigreeFunction > 0.415: Positive (5)
## DiabetesPedigreeFunction <= 0.415:
## :...BloodPressure <= 68: Negative (2)
## BloodPressure > 68:
## :...BloodPressure <= 70: Positive (2)
## BloodPressure > 70: Negative (2)
##
##
## Evaluation on training data (235 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 16 16( 6.8%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 159 3 (a): class Negative
## 13 60 (b): class Positive
##
##
## Attribute usage:
##
## 100.00% Glucose
## 67.66% DiabetesPedigreeFunction
## 54.47% Age
## 15.74% Insulin
## 13.62% BloodPressure
## 7.23% BMI
##
##
## Time: 0.0 secs
plot(c5_model, subtree = 2)

c5_pred <- predict(c5_model, test[,-9])
cm_c5_orig <- confusionMatrix(table(c5_pred, test$Outcome))
cm_c5_orig
## Confusion Matrix and Statistics
##
##
## c5_pred Negative Positive
## Negative 55 21
## Positive 8 17
##
## Accuracy : 0.7129
## 95% CI : (0.6143, 0.7985)
## No Information Rate : 0.6238
## P-Value [Acc > NIR] : 0.03858
##
## Kappa : 0.3437
## Mcnemar's Test P-Value : 0.02586
##
## Sensitivity : 0.8730
## Specificity : 0.4474
## Pos Pred Value : 0.7237
## Neg Pred Value : 0.6800
## Prevalence : 0.6238
## Detection Rate : 0.5446
## Detection Prevalence : 0.7525
## Balanced Accuracy : 0.6602
##
## 'Positive' Class : Negative
##
set.seed(1234)
c5_boost <- C5.0(train[,-9], train$Outcome, trials = 6)
c5_boost
##
## Call:
## C5.0.default(x = train[, -9], y = train$Outcome, trials = 6)
##
## Classification Tree
## Number of samples: 235
## Number of predictors: 8
##
## Number of boosting iterations: 6
## Average tree size: 13.3
##
## Non-standard options: attempt to group attributes
summary(c5_boost)
##
## Call:
## C5.0.default(x = train[, -9], y = train$Outcome, trials = 6)
##
##
## C5.0 [Release 2.07 GPL Edition] Fri Apr 20 20:29:46 2018
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 235 cases (9 attributes) from undefined.data
##
## ----- Trial 0: -----
##
## Decision tree:
##
## Glucose <= 127:
## :...DiabetesPedigreeFunction <= 0.673: Negative (107/5)
## : DiabetesPedigreeFunction > 0.673:
## : :...Age <= 43: Negative (33/8)
## : Age > 43: Positive (5)
## Glucose > 127:
## :...Age <= 24:
## :...BMI <= 38.7: Negative (13)
## : BMI > 38.7:
## : :...Insulin <= 335: Positive (2)
## : Insulin > 335: Negative (2)
## Age > 24:
## :...Glucose > 154:
## :...Insulin <= 83: Negative (2)
## : Insulin > 83: Positive (31/2)
## Glucose <= 154:
## :...Glucose > 152: Negative (5)
## Glucose <= 152:
## :...Age > 55: Negative (3)
## Age <= 55:
## :...BloodPressure > 76: Positive (18/1)
## BloodPressure <= 76:
## :...DiabetesPedigreeFunction > 0.598: Negative (3)
## DiabetesPedigreeFunction <= 0.598:
## :...DiabetesPedigreeFunction > 0.415: Positive (5)
## DiabetesPedigreeFunction <= 0.415:
## :...BloodPressure <= 68: Negative (2)
## BloodPressure > 68:
## :...BloodPressure <= 70: Positive (2)
## BloodPressure > 70: Negative (2)
##
## ----- Trial 1: -----
##
## Decision tree:
##
## BMI <= 26.3: Negative (40.7/2.3)
## BMI > 26.3:
## :...Insulin <= 68: Negative (25.4/1.5)
## Insulin > 68:
## :...Age <= 22: Negative (12.3)
## Age > 22:
## :...DiabetesPedigreeFunction > 0.528: Positive (78.3/16.1)
## DiabetesPedigreeFunction <= 0.528:
## :...BMI > 43.5: Positive (9.9)
## BMI <= 43.5:
## :...Pregnancies > 10: Positive (6.5)
## Pregnancies <= 10:
## :...Glucose <= 81: Positive (4.2)
## Glucose > 81:
## :...Glucose <= 127: Negative (23)
## Glucose > 127:
## :...DiabetesPedigreeFunction <= 0.306: Negative (22.5/4.6)
## DiabetesPedigreeFunction > 0.306: Positive (12.3/3.1)
##
## ----- Trial 2: -----
##
## Decision tree:
##
## Glucose > 157:
## :...SkinThickness <= 17: Negative (5.1/0.6)
## : SkinThickness > 17: Positive (26.1/1.2)
## Glucose <= 157:
## :...BMI <= 26.3: Negative (29.5)
## BMI > 26.3:
## :...Age <= 30:
## :...SkinThickness <= 30: Negative (51.1/3.8)
## : SkinThickness > 30:
## : :...DiabetesPedigreeFunction > 0.893: Positive (10.4/0.6)
## : DiabetesPedigreeFunction <= 0.893:
## : :...Glucose > 119: Negative (25.4/2.4)
## : Glucose <= 119:
## : :...Insulin <= 82: Negative (4.8)
## : Insulin > 82: Positive (16.1/3)
## Age > 30:
## :...Pregnancies <= 1: Positive (12.4/0.6)
## Pregnancies > 1:
## :...Glucose <= 90: Negative (5)
## Glucose > 90:
## :...SkinThickness <= 26: Positive (14.6/1.2)
## SkinThickness > 26:
## :...BloodPressure <= 74: Negative (17.5/1.8)
## BloodPressure > 74: Positive (16.9/4.4)
##
## ----- Trial 3: -----
##
## Decision tree:
##
## Age <= 22: Negative (19.6)
## Age > 22:
## :...Insulin <= 87:
## :...DiabetesPedigreeFunction <= 1.268: Negative (39.5/2)
## : DiabetesPedigreeFunction > 1.268: Positive (4.1)
## Insulin > 87:
## :...Glucose > 154: Positive (35.1/6.4)
## Glucose <= 154:
## :...Glucose > 152: Negative (12.8)
## Glucose <= 152:
## :...BMI > 39.7: Positive (15.7/0.9)
## BMI <= 39.7:
## :...Pregnancies > 10: Positive (7.2/0.5)
## Pregnancies <= 10:
## :...BloodPressure <= 50: Positive (8.5/0.9)
## BloodPressure > 50: Negative (92.5/31.8)
##
## ----- Trial 4: -----
##
## Decision tree:
##
## BMI <= 25: Negative (14.6)
## BMI > 25:
## :...Age <= 24: Negative (42/8)
## Age > 24:
## :...Insulin <= 86: Negative (29.2/7.7)
## Insulin > 86:
## :...DiabetesPedigreeFunction <= 0.229: Negative (19.9/4.4)
## DiabetesPedigreeFunction > 0.229:
## :...SkinThickness > 45: Positive (14.7)
## SkinThickness <= 45:
## :...Insulin <= 100: Positive (13.7/0.4)
## Insulin > 100:
## :...Insulin > 155:
## :...SkinThickness <= 41: Positive (51.9/10)
## : SkinThickness > 41: Negative (3.8)
## Insulin <= 155:
## :...Glucose <= 124: Negative (10.4/0.4)
## Glucose > 124:
## :...Pregnancies <= 2: Positive (7.9)
## Pregnancies > 2:
## :...SkinThickness <= 28: Positive (15/5.3)
## SkinThickness > 28: Negative (11.8/1.5)
##
## ----- Trial 5: -----
##
## Decision tree:
##
## Age <= 22: Negative (12.6)
## Age > 22:
## :...BloodPressure > 88: Positive (14.3/0.6)
## BloodPressure <= 88:
## :...Insulin <= 68: Negative (15)
## Insulin > 68:
## :...SkinThickness <= 30:
## :...Age <= 26: Negative (25/2.4)
## : Age > 26:
## : :...Glucose > 181: Negative (5.3/0.3)
## : Glucose <= 181:
## : :...SkinThickness > 25: Negative (22.2/4.7)
## : SkinThickness <= 25:
## : :...BloodPressure > 82: Negative (2.6)
## : BloodPressure <= 82:
## : :...BMI <= 25.4: Negative (3.5)
## : BMI > 25.4: Positive (27.8/2.8)
## SkinThickness > 30:
## :...Glucose > 157: Positive (14)
## Glucose <= 157:
## :...Glucose > 152: Negative (10.3)
## Glucose <= 152:
## :...BloodPressure > 78: Positive (22.2/1.8)
## BloodPressure <= 78:
## :...Age > 57: Negative (4.1)
## Age <= 57:
## :...Pregnancies > 7: Positive (4.8)
## Pregnancies <= 7: [S1]
##
## SubTree [S1]
##
## DiabetesPedigreeFunction <= 0.332: Negative (11.6/1.6)
## DiabetesPedigreeFunction > 0.332:
## :...Age > 42: Positive (5.6)
## Age <= 42:
## :...Age > 32: Negative (3.6)
## Age <= 32:
## :...SkinThickness > 45: Negative (2.5)
## SkinThickness <= 45:
## :...BMI <= 41.3: Positive (24.7/4.6)
## BMI > 41.3: Negative (3.3/0.3)
##
##
## Evaluation on training data (235 cases):
##
## Trial Decision Tree
## ----- ----------------
## Size Errors
##
## 0 16 16( 6.8%)
## 1 10 36(15.3%)
## 2 13 28(11.9%)
## 3 9 32(13.6%)
## 4 12 30(12.8%)
## 5 20 28(11.9%)
## boost 1( 0.4%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 162 (a): class Negative
## 1 72 (b): class Positive
##
##
## Attribute usage:
##
## 100.00% Glucose
## 100.00% BMI
## 100.00% Age
## 96.60% DiabetesPedigreeFunction
## 92.77% Insulin
## 85.96% SkinThickness
## 83.83% BloodPressure
## 60.00% Pregnancies
##
##
## Time: 0.0 secs
plot(c5_boost, subtree = 2)

c5_boost_pred <- predict(c5_boost, test[,-9])
cm_c5_boost <- confusionMatrix(table(c5_boost_pred, test$Outcome))
cm_c5_boost
## Confusion Matrix and Statistics
##
##
## c5_boost_pred Negative Positive
## Negative 52 12
## Positive 11 26
##
## Accuracy : 0.7723
## 95% CI : (0.6782, 0.8498)
## No Information Rate : 0.6238
## P-Value [Acc > NIR] : 0.001057
##
## Kappa : 0.5123
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 0.8254
## Specificity : 0.6842
## Pos Pred Value : 0.8125
## Neg Pred Value : 0.7027
## Prevalence : 0.6238
## Detection Rate : 0.5149
## Detection Prevalence : 0.6337
## Balanced Accuracy : 0.7548
##
## 'Positive' Class : Negative
##
Using the Recursive Partitioning (R-PART) Model –
set.seed(1234)
rp_model <- rpart(Outcome~., data=train, cp=0.01)
rp_model
## n= 235
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 235 73 Negative (0.68936170 0.31063830)
## 2) Glucose< 127.5 145 18 Negative (0.87586207 0.12413793)
## 4) DiabetesPedigreeFunction< 0.6735 107 5 Negative (0.95327103 0.04672897) *
## 5) DiabetesPedigreeFunction>=0.6735 38 13 Negative (0.65789474 0.34210526)
## 10) Age< 40 31 7 Negative (0.77419355 0.22580645) *
## 11) Age>=40 7 1 Positive (0.14285714 0.85714286) *
## 3) Glucose>=127.5 90 35 Positive (0.38888889 0.61111111)
## 6) Age< 24.5 17 2 Negative (0.88235294 0.11764706) *
## 7) Age>=24.5 73 20 Positive (0.27397260 0.72602740)
## 14) Glucose< 154.5 40 16 Positive (0.40000000 0.60000000)
## 28) BloodPressure< 77 18 7 Negative (0.61111111 0.38888889) *
## 29) BloodPressure>=77 22 5 Positive (0.22727273 0.77272727) *
## 15) Glucose>=154.5 33 4 Positive (0.12121212 0.87878788) *
summary(rp_model)
## Call:
## rpart(formula = Outcome ~ ., data = train, cp = 0.01)
## n= 235
##
## CP nsplit rel error xerror xstd
## 1 0.27397260 0 1.0000000 1.0000000 0.09717670
## 2 0.17808219 1 0.7260274 0.8219178 0.09156676
## 3 0.03424658 2 0.5479452 0.5479452 0.07892062
## 4 0.02739726 4 0.4794521 0.6438356 0.08399841
## 5 0.01000000 6 0.4246575 0.7260274 0.08776409
##
## Variable importance
## Glucose Age Pregnancies
## 32 23 12
## Insulin BloodPressure SkinThickness
## 11 9 6
## DiabetesPedigreeFunction BMI
## 5 1
##
## Node number 1: 235 observations, complexity param=0.2739726
## predicted class=Negative expected loss=0.3106383 P(node) =1
## class counts: 162 73
## probabilities: 0.689 0.311
## left son=2 (145 obs) right son=3 (90 obs)
## Primary splits:
## Glucose < 127.5 to the left, improve=26.338000, (0 missing)
## Age < 28.5 to the left, improve=21.612980, (0 missing)
## Insulin < 119.5 to the left, improve=17.115850, (0 missing)
## Pregnancies < 6.5 to the left, improve=12.205670, (0 missing)
## BMI < 26.45 to the left, improve= 8.832585, (0 missing)
## Surrogate splits:
## Insulin < 123.5 to the left, agree=0.728, adj=0.289, (0 split)
## Age < 33.5 to the left, agree=0.694, adj=0.200, (0 split)
## Pregnancies < 6.5 to the left, agree=0.689, adj=0.189, (0 split)
## BloodPressure < 77 to the left, agree=0.668, adj=0.133, (0 split)
## SkinThickness < 32.5 to the left, agree=0.660, adj=0.111, (0 split)
##
## Node number 2: 145 observations, complexity param=0.03424658
## predicted class=Negative expected loss=0.1241379 P(node) =0.6170213
## class counts: 127 18
## probabilities: 0.876 0.124
## left son=4 (107 obs) right son=5 (38 obs)
## Primary splits:
## DiabetesPedigreeFunction < 0.6735 to the left, improve=4.893061, (0 missing)
## Age < 43.5 to the left, improve=3.696448, (0 missing)
## Insulin < 145 to the left, improve=3.336229, (0 missing)
## BMI < 40.7 to the left, improve=3.034738, (0 missing)
## Pregnancies < 7.5 to the left, improve=1.968943, (0 missing)
## Surrogate splits:
## Insulin < 203.5 to the left, agree=0.759, adj=0.079, (0 split)
## BMI < 40.1 to the left, agree=0.759, adj=0.079, (0 split)
## SkinThickness < 9 to the right, agree=0.752, adj=0.053, (0 split)
##
## Node number 3: 90 observations, complexity param=0.1780822
## predicted class=Positive expected loss=0.3888889 P(node) =0.3829787
## class counts: 35 55
## probabilities: 0.389 0.611
## left son=6 (17 obs) right son=7 (73 obs)
## Primary splits:
## Age < 24.5 to the left, improve=10.207270, (0 missing)
## Glucose < 154.5 to the left, improve= 5.011332, (0 missing)
## BMI < 29.5 to the left, improve= 4.283294, (0 missing)
## Pregnancies < 1.5 to the left, improve= 2.793894, (0 missing)
## SkinThickness < 22.5 to the left, improve= 2.793894, (0 missing)
## Surrogate splits:
## Pregnancies < 1.5 to the left, agree=0.867, adj=0.294, (0 split)
## BloodPressure < 49 to the left, agree=0.833, adj=0.118, (0 split)
## SkinThickness < 13.5 to the left, agree=0.833, adj=0.118, (0 split)
##
## Node number 4: 107 observations
## predicted class=Negative expected loss=0.04672897 P(node) =0.4553191
## class counts: 102 5
## probabilities: 0.953 0.047
##
## Node number 5: 38 observations, complexity param=0.03424658
## predicted class=Negative expected loss=0.3421053 P(node) =0.1617021
## class counts: 25 13
## probabilities: 0.658 0.342
## left son=10 (31 obs) right son=11 (7 obs)
## Primary splits:
## Age < 40 to the left, improve=4.552268, (0 missing)
## Pregnancies < 3.5 to the left, improve=3.296568, (0 missing)
## BloodPressure < 75 to the left, improve=2.377153, (0 missing)
## Insulin < 140 to the left, improve=2.158484, (0 missing)
## Glucose < 110.5 to the left, improve=1.523725, (0 missing)
## Surrogate splits:
## Pregnancies < 7.5 to the left, agree=0.868, adj=0.286, (0 split)
##
## Node number 6: 17 observations
## predicted class=Negative expected loss=0.1176471 P(node) =0.07234043
## class counts: 15 2
## probabilities: 0.882 0.118
##
## Node number 7: 73 observations, complexity param=0.02739726
## predicted class=Positive expected loss=0.2739726 P(node) =0.3106383
## class counts: 20 53
## probabilities: 0.274 0.726
## left son=14 (40 obs) right son=15 (33 obs)
## Primary splits:
## Glucose < 154.5 to the left, improve=2.810793, (0 missing)
## Insulin < 142 to the left, improve=2.586832, (0 missing)
## DiabetesPedigreeFunction < 0.3425 to the left, improve=1.525798, (0 missing)
## BMI < 29.5 to the left, improve=1.402015, (0 missing)
## SkinThickness < 44 to the left, improve=1.162308, (0 missing)
## Surrogate splits:
## Insulin < 238.5 to the left, agree=0.685, adj=0.303, (0 split)
## BloodPressure < 71 to the right, agree=0.658, adj=0.242, (0 split)
## Pregnancies < 3.5 to the right, agree=0.630, adj=0.182, (0 split)
## SkinThickness < 20 to the right, agree=0.616, adj=0.152, (0 split)
## BMI < 25.85 to the right, agree=0.575, adj=0.061, (0 split)
##
## Node number 10: 31 observations
## predicted class=Negative expected loss=0.2258065 P(node) =0.1319149
## class counts: 24 7
## probabilities: 0.774 0.226
##
## Node number 11: 7 observations
## predicted class=Positive expected loss=0.1428571 P(node) =0.02978723
## class counts: 1 6
## probabilities: 0.143 0.857
##
## Node number 14: 40 observations, complexity param=0.02739726
## predicted class=Positive expected loss=0.4 P(node) =0.1702128
## class counts: 16 24
## probabilities: 0.400 0.600
## left son=28 (18 obs) right son=29 (22 obs)
## Primary splits:
## BloodPressure < 77 to the left, improve=2.917172, (0 missing)
## Pregnancies < 3.5 to the right, improve=2.899060, (0 missing)
## Glucose < 130.5 to the right, improve=2.715152, (0 missing)
## BMI < 31.45 to the left, improve=2.540659, (0 missing)
## SkinThickness < 31.5 to the left, improve=1.786895, (0 missing)
## Surrogate splits:
## SkinThickness < 33.5 to the left, agree=0.675, adj=0.278, (0 split)
## Age < 30 to the left, agree=0.675, adj=0.278, (0 split)
## Pregnancies < 7.5 to the left, agree=0.650, adj=0.222, (0 split)
## Insulin < 186 to the right, agree=0.650, adj=0.222, (0 split)
## BMI < 36.25 to the left, agree=0.650, adj=0.222, (0 split)
##
## Node number 15: 33 observations
## predicted class=Positive expected loss=0.1212121 P(node) =0.1404255
## class counts: 4 29
## probabilities: 0.121 0.879
##
## Node number 28: 18 observations
## predicted class=Negative expected loss=0.3888889 P(node) =0.07659574
## class counts: 11 7
## probabilities: 0.611 0.389
##
## Node number 29: 22 observations
## predicted class=Positive expected loss=0.2272727 P(node) =0.09361702
## class counts: 5 17
## probabilities: 0.227 0.773
rpart.plot(rp_model, type = 4, extra = 1, clip.right.labs = FALSE)

rp_pred <- predict(rp_model, test, type = 'class')
cm_rp_orig <- confusionMatrix(table(rp_pred, test$Outcome))
cm_rp_orig
## Confusion Matrix and Statistics
##
##
## rp_pred Negative Positive
## Negative 55 20
## Positive 8 18
##
## Accuracy : 0.7228
## 95% CI : (0.6248, 0.8072)
## No Information Rate : 0.6238
## P-Value [Acc > NIR] : 0.02377
##
## Kappa : 0.3699
## Mcnemar's Test P-Value : 0.03764
##
## Sensitivity : 0.8730
## Specificity : 0.4737
## Pos Pred Value : 0.7333
## Neg Pred Value : 0.6923
## Prevalence : 0.6238
## Detection Rate : 0.5446
## Detection Prevalence : 0.7426
## Balanced Accuracy : 0.6734
##
## 'Positive' Class : Negative
##
set.seed(1234)
control <- rpart.control(cp = 0.000, xxval = 100, minsplit = 2)
rp_model <- rpart(Outcome~., data = train, control = control)
plotcp(rp_model)

printcp(rp_model)
##
## Classification tree:
## rpart(formula = Outcome ~ ., data = train, control = control)
##
## Variables actually used in tree construction:
## [1] Age BloodPressure
## [3] BMI DiabetesPedigreeFunction
## [5] Glucose Insulin
## [7] Pregnancies SkinThickness
##
## Root node error: 73/235 = 0.31064
##
## n= 235
##
## CP nsplit rel error xerror xstd
## 1 0.2739726 0 1.000000 1.00000 0.097177
## 2 0.1780822 1 0.726027 0.82192 0.091567
## 3 0.0342466 2 0.547945 0.54795 0.078921
## 4 0.0273973 6 0.410959 0.58904 0.081195
## 5 0.0182648 11 0.273973 0.61644 0.082628
## 6 0.0136986 14 0.219178 0.65753 0.084661
## 7 0.0068493 26 0.054795 0.64384 0.083998
## 8 0.0000000 34 0.000000 0.65753 0.084661
set.seed(1234)
selected_tr <- prune(rp_model, cp = rp_model$cptable[which.min(rp_model$cptable[,"xerror"]), "CP"])
rpart.plot(selected_tr, type = 4, extra = 1, clip.right.labs = FALSE)

rp_pred_tune <- predict(selected_tr, test, type = 'class')
cm_rp_tune <- confusionMatrix(table(rp_pred_tune, test$Outcome))
cm_rp_tune
## Confusion Matrix and Statistics
##
##
## rp_pred_tune Negative Positive
## Negative 53 16
## Positive 10 22
##
## Accuracy : 0.7426
## 95% CI : (0.646, 0.8244)
## No Information Rate : 0.6238
## P-Value [Acc > NIR] : 0.007895
##
## Kappa : 0.4338
## Mcnemar's Test P-Value : 0.326800
##
## Sensitivity : 0.8413
## Specificity : 0.5789
## Pos Pred Value : 0.7681
## Neg Pred Value : 0.6875
## Prevalence : 0.6238
## Detection Rate : 0.5248
## Detection Prevalence : 0.6832
## Balanced Accuracy : 0.7101
##
## 'Positive' Class : Negative
##
Using the One Rule Classification Model –
set.seed(1234)
oneR_model <- OneR(Outcome~., data = train)
oneR_model
## Glucose:
## < 127.5 -> Negative
## < 129.5 -> Positive
## < 143.5 -> Negative
## < 149.0 -> Positive
## < 154.5 -> Negative
## >= 154.5 -> Positive
## (193/235 instances correct)
summary(oneR_model)
##
## === Summary ===
##
## Correctly Classified Instances 193 82.1277 %
## Incorrectly Classified Instances 42 17.8723 %
## Kappa statistic 0.5524
## Mean absolute error 0.1787
## Root mean squared error 0.4228
## Relative absolute error 41.6712 %
## Root relative squared error 91.356 %
## Total Number of Instances 235
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 150 12 | a = Negative
## 30 43 | b = Positive
oneR_pred <- predict(oneR_model, test, type = 'class')
cm_oneR_orig <- confusionMatrix(oneR_pred, test$Outcome)
cm_oneR_orig
## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Positive
## Negative 53 19
## Positive 10 19
##
## Accuracy : 0.7129
## 95% CI : (0.6143, 0.7985)
## No Information Rate : 0.6238
## P-Value [Acc > NIR] : 0.03858
##
## Kappa : 0.3581
## Mcnemar's Test P-Value : 0.13739
##
## Sensitivity : 0.8413
## Specificity : 0.5000
## Pos Pred Value : 0.7361
## Neg Pred Value : 0.6552
## Prevalence : 0.6238
## Detection Rate : 0.5248
## Detection Prevalence : 0.7129
## Balanced Accuracy : 0.6706
##
## 'Positive' Class : Negative
##
Using the JRip Rule Learning Model –
set.seed(1234)
jrip_model <- JRip(Outcome~., data = train)
jrip_model
## JRIP rules:
## ===========
##
## (Glucose >= 128) and (Age >= 25) => Outcome=Positive (73.0/20.0)
## (Age >= 41) and (DiabetesPedigreeFunction >= 0.412) => Outcome=Positive (9.0/2.0)
## => Outcome=Negative (153.0/13.0)
##
## Number of Rules : 3
summary(jrip_model)
##
## === Summary ===
##
## Correctly Classified Instances 200 85.1064 %
## Incorrectly Classified Instances 35 14.8936 %
## Kappa statistic 0.6636
## Mean absolute error 0.2381
## Root mean squared error 0.345
## Relative absolute error 55.5051 %
## Root relative squared error 74.5539 %
## Total Number of Instances 235
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 140 22 | a = Negative
## 13 60 | b = Positive
jrip_pred <- predict(jrip_model, test, type = 'class')
cm_jrip_orig <- confusionMatrix(jrip_pred, test$Outcome)
cm_jrip_orig
## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Positive
## Negative 48 15
## Positive 15 23
##
## Accuracy : 0.703
## 95% CI : (0.6039, 0.7898)
## No Information Rate : 0.6238
## P-Value [Acc > NIR] : 0.06002
##
## Kappa : 0.3672
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.7619
## Specificity : 0.6053
## Pos Pred Value : 0.7619
## Neg Pred Value : 0.6053
## Prevalence : 0.6238
## Detection Rate : 0.4752
## Detection Prevalence : 0.6238
## Balanced Accuracy : 0.6836
##
## 'Positive' Class : Negative
##
Using the Naive Bayes Model (with and without Laplace Smoothing; Laplace Parameter = 50) –
set.seed(1234)
nb_model <- naiveBayes(train, train$Outcome)
nb_model
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = train, y = train$Outcome)
##
## A-priori probabilities:
## train$Outcome
## Negative Positive
## 0.6893617 0.3106383
##
## Conditional probabilities:
## Pregnancies
## train$Outcome [,1] [,2]
## Negative 3.067901 2.482552
## Positive 5.479452 3.869789
##
## Glucose
## train$Outcome [,1] [,2]
## Negative 110.3333 25.43705
## Positive 144.3288 28.50539
##
## BloodPressure
## train$Outcome [,1] [,2]
## Negative 67.35802 11.68405
## Positive 74.19178 12.52182
##
## SkinThickness
## train$Outcome [,1] [,2]
## Negative 26.57407 9.945695
## Positive 32.75342 9.352333
##
## Insulin
## train$Outcome [,1] [,2]
## Negative 122.3025 87.12696
## Positive 200.9041 120.79450
##
## BMI
## train$Outcome [,1] [,2]
## Negative 30.86605 6.205399
## Positive 34.83425 5.665348
##
## DiabetesPedigreeFunction
## train$Outcome [,1] [,2]
## Negative 0.4670185 0.2952471
## Positive 0.6424521 0.3718617
##
## Age
## train$Outcome [,1] [,2]
## Negative 28.46914 9.391534
## Positive 37.71233 10.123514
##
## Outcome
## train$Outcome Negative Positive
## Negative 1 0
## Positive 0 1
nb_pred <- predict(nb_model, test)
cm_nb_orig <- confusionMatrix(table(nb_pred, test$Outcome))
cm_nb_orig
## Confusion Matrix and Statistics
##
##
## nb_pred Negative Positive
## Negative 61 1
## Positive 2 37
##
## Accuracy : 0.9703
## 95% CI : (0.9156, 0.9938)
## No Information Rate : 0.6238
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.937
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9683
## Specificity : 0.9737
## Pos Pred Value : 0.9839
## Neg Pred Value : 0.9487
## Prevalence : 0.6238
## Detection Rate : 0.6040
## Detection Prevalence : 0.6139
## Balanced Accuracy : 0.9710
##
## 'Positive' Class : Negative
##
set.seed(1234)
nb_lap_model <- naiveBayes(train, train$Outcome, laplace = 50)
nb_lap_model
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = train, y = train$Outcome, laplace = 50)
##
## A-priori probabilities:
## train$Outcome
## Negative Positive
## 0.6893617 0.3106383
##
## Conditional probabilities:
## Pregnancies
## train$Outcome [,1] [,2]
## Negative 3.067901 2.482552
## Positive 5.479452 3.869789
##
## Glucose
## train$Outcome [,1] [,2]
## Negative 110.3333 25.43705
## Positive 144.3288 28.50539
##
## BloodPressure
## train$Outcome [,1] [,2]
## Negative 67.35802 11.68405
## Positive 74.19178 12.52182
##
## SkinThickness
## train$Outcome [,1] [,2]
## Negative 26.57407 9.945695
## Positive 32.75342 9.352333
##
## Insulin
## train$Outcome [,1] [,2]
## Negative 122.3025 87.12696
## Positive 200.9041 120.79450
##
## BMI
## train$Outcome [,1] [,2]
## Negative 30.86605 6.205399
## Positive 34.83425 5.665348
##
## DiabetesPedigreeFunction
## train$Outcome [,1] [,2]
## Negative 0.4670185 0.2952471
## Positive 0.6424521 0.3718617
##
## Age
## train$Outcome [,1] [,2]
## Negative 28.46914 9.391534
## Positive 37.71233 10.123514
##
## Outcome
## train$Outcome Negative Positive
## Negative 0.8091603 0.1908397
## Positive 0.2890173 0.7109827
nb_lap_pred <- predict(nb_lap_model, test)
cm_nb_lapl<- confusionMatrix(table(nb_lap_pred, test$Outcome))
cm_nb_lapl
## Confusion Matrix and Statistics
##
##
## nb_lap_pred Negative Positive
## Negative 53 11
## Positive 10 27
##
## Accuracy : 0.7921
## 95% CI : (0.6999, 0.8664)
## No Information Rate : 0.6238
## P-Value [Acc > NIR] : 0.0002133
##
## Kappa : 0.5547
## Mcnemar's Test P-Value : 1.0000000
##
## Sensitivity : 0.8413
## Specificity : 0.7105
## Pos Pred Value : 0.8281
## Neg Pred Value : 0.7297
## Prevalence : 0.6238
## Detection Rate : 0.5248
## Detection Prevalence : 0.6337
## Balanced Accuracy : 0.7759
##
## 'Positive' Class : Negative
##
Using the Linear Discriminant Analysis (LDA) Model –
set.seed(1234)
lda_model <- lda(data = train, Outcome~.)
lda_model
## Call:
## lda(Outcome ~ ., data = train)
##
## Prior probabilities of groups:
## Negative Positive
## 0.6893617 0.3106383
##
## Group means:
## Pregnancies Glucose BloodPressure SkinThickness Insulin
## Negative 3.067901 110.3333 67.35802 26.57407 122.3025
## Positive 5.479452 144.3288 74.19178 32.75342 200.9041
## BMI DiabetesPedigreeFunction Age
## Negative 30.86605 0.4670185 28.46914
## Positive 34.83425 0.6424521 37.71233
##
## Coefficients of linear discriminants:
## LD1
## Pregnancies 0.072531222
## Glucose 0.023410050
## BloodPressure 0.009540306
## SkinThickness 0.006584657
## Insulin 0.000944487
## BMI 0.038839475
## DiabetesPedigreeFunction 1.093580091
## Age 0.024590227
plot(lda_model)

lda_pred <- predict(lda_model, test)
cm_lda_orig <- confusionMatrix(table(lda_pred$class, test$Outcome))
cm_lda_orig
## Confusion Matrix and Statistics
##
##
## Negative Positive
## Negative 57 19
## Positive 6 19
##
## Accuracy : 0.7525
## 95% CI : (0.6567, 0.833)
## No Information Rate : 0.6238
## P-Value [Acc > NIR] : 0.004243
##
## Kappa : 0.4342
## Mcnemar's Test P-Value : 0.016395
##
## Sensitivity : 0.9048
## Specificity : 0.5000
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.7600
## Prevalence : 0.6238
## Detection Rate : 0.5644
## Detection Prevalence : 0.7525
## Balanced Accuracy : 0.7024
##
## 'Positive' Class : Negative
##
Using the Random Forest Classification Model –
set.seed(1234)
rf_model <- randomForest(Outcome~., data = train, ntree = 500, proximity = TRUE, importance = TRUE)
rf_model
##
## Call:
## randomForest(formula = Outcome ~ ., data = train, ntree = 500, proximity = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20%
## Confusion matrix:
## Negative Positive class.error
## Negative 142 20 0.1234568
## Positive 27 46 0.3698630
varImpPlot(rf_model, cex=0.5)

plot(rf_model, log = "x", main="Random Forest (Error Rate vs. Number of Trees)")

rf_pred <- predict(rf_model, test)
cm_rf_orig <- confusionMatrix(table(rf_pred, test$Outcome))
cm_rf_orig
## Confusion Matrix and Statistics
##
##
## rf_pred Negative Positive
## Negative 56 18
## Positive 7 20
##
## Accuracy : 0.7525
## 95% CI : (0.6567, 0.833)
## No Information Rate : 0.6238
## P-Value [Acc > NIR] : 0.004243
##
## Kappa : 0.4405
## Mcnemar's Test P-Value : 0.045500
##
## Sensitivity : 0.8889
## Specificity : 0.5263
## Pos Pred Value : 0.7568
## Neg Pred Value : 0.7407
## Prevalence : 0.6238
## Detection Rate : 0.5545
## Detection Prevalence : 0.7327
## Balanced Accuracy : 0.7076
##
## 'Positive' Class : Negative
##
set.seed(1234)
rf_new_model <- randomForest(Outcome~., data = train, ntree = 2000, proximity = TRUE, importance = TRUE)
rf_new_model
##
## Call:
## randomForest(formula = Outcome ~ ., data = train, ntree = 2000, proximity = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 2000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20.43%
## Confusion matrix:
## Negative Positive class.error
## Negative 143 19 0.1172840
## Positive 29 44 0.3972603
varImpPlot(rf_new_model, cex=0.5)

plot(rf_new_model, log = "x", main="Random Forest (Error Rate vs. Number of Trees)")

rf_new_pred <- predict(rf_new_model, test)
cm_rf_tune <- confusionMatrix(table(rf_new_pred, test$Outcome))
cm_rf_tune
## Confusion Matrix and Statistics
##
##
## rf_new_pred Negative Positive
## Negative 56 17
## Positive 7 21
##
## Accuracy : 0.7624
## 95% CI : (0.6674, 0.8414)
## No Information Rate : 0.6238
## P-Value [Acc > NIR] : 0.002172
##
## Kappa : 0.4658
## Mcnemar's Test P-Value : 0.066193
##
## Sensitivity : 0.8889
## Specificity : 0.5526
## Pos Pred Value : 0.7671
## Neg Pred Value : 0.7500
## Prevalence : 0.6238
## Detection Rate : 0.5545
## Detection Prevalence : 0.7228
## Balanced Accuracy : 0.7208
##
## 'Positive' Class : Negative
##
Using the Classification Tree (C-Tree) Model –
set.seed(1234)
ctree_model <- ctree(Outcome~., data = train, controls=ctree_control(maxdepth=5))
ctree_model
##
## Conditional inference tree with 5 terminal nodes
##
## Response: Outcome
## Inputs: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age
## Number of observations: 235
##
## 1) Glucose <= 127; criterion = 1, statistic = 61.625
## 2) Age <= 43; criterion = 0.999, statistic = 15.76
## 3) BMI <= 40.5; criterion = 0.991, statistic = 10.702
## 4) DiabetesPedigreeFunction <= 0.673; criterion = 0.983, statistic = 9.42
## 5)* weights = 96
## 4) DiabetesPedigreeFunction > 0.673
## 6)* weights = 28
## 3) BMI > 40.5
## 7)* weights = 9
## 2) Age > 43
## 8)* weights = 12
## 1) Glucose > 127
## 9)* weights = 90
plot(ctree_model)

ctree_pred <- predict(ctree_model, test)
cm_ctree_orig <- confusionMatrix(table(ctree_pred, test$Outcome))
cm_ctree_orig
## Confusion Matrix and Statistics
##
##
## ctree_pred Negative Positive
## Negative 49 13
## Positive 14 25
##
## Accuracy : 0.7327
## 95% CI : (0.6354, 0.8159)
## No Information Rate : 0.6238
## P-Value [Acc > NIR] : 0.01401
##
## Kappa : 0.4334
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.7778
## Specificity : 0.6579
## Pos Pred Value : 0.7903
## Neg Pred Value : 0.6410
## Prevalence : 0.6238
## Detection Rate : 0.4851
## Detection Prevalence : 0.6139
## Balanced Accuracy : 0.7178
##
## 'Positive' Class : Negative
##
Using the K-Means Clustering Technique (Best Result: K = 7) –
set.seed(1234)
df_z <- as.data.frame(lapply(df[,-9], scale))
km_model <- kmeans(df_z, 3)
km_model
## K-means clustering with 3 clusters of sizes 148, 110, 78
##
## Cluster means:
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 -0.4399046 -0.5044645 -0.4307171 -0.7562568 -0.4395881 -0.6866542
## 2 -0.4291739 0.2404463 0.2037941 0.7627544 0.2753585 0.7769629
## 3 1.4399360 0.6180980 0.5298561 0.3592695 0.4457642 0.2071655
## DiabetesPedigreeFunction Age
## 1 -0.12608667 -0.5341434
## 2 0.13348950 -0.2920598
## 3 0.05098695 1.4253819
##
## Clustering vector:
## [1] 1 1 3 3 3 2 2 2 3 3 1 3 2 1 1 3 2 3 1 1 1 3 3 3 1 1 1 1 2 2 1 2 1 3 1
## [36] 2 1 3 1 1 2 1 1 1 1 2 3 1 3 1 1 2 2 2 2 1 2 1 1 2 1 2 2 2 3 2 1 1 1 3
## [71] 3 1 1 2 2 1 3 3 2 3 2 3 2 1 2 2 1 3 3 1 3 3 2 1 3 1 1 2 3 1 1 3 1 1 2
## [106] 3 1 3 1 3 1 1 1 2 2 1 3 3 3 2 2 1 2 2 2 2 2 3 2 2 2 3 2 1 1 1 1 2 1 3
## [141] 1 2 2 1 1 1 3 1 1 3 1 1 1 2 3 2 2 2 1 1 2 2 2 2 3 2 1 1 1 1 1 3 1 1 1
## [176] 1 1 1 2 2 2 2 2 1 2 1 2 1 3 2 2 2 1 1 1 1 1 1 1 1 3 3 3 3 2 2 1 3 2 1
## [211] 2 1 1 1 3 3 1 1 1 1 1 1 3 3 1 2 1 1 1 2 1 2 3 2 2 1 3 3 1 2 1 1 1 3 2
## [246] 1 1 2 3 2 1 1 2 2 1 3 3 2 1 2 1 1 3 2 1 1 1 2 3 3 1 2 2 1 1 2 1 1 2 1
## [281] 3 1 1 2 1 2 1 2 2 3 3 2 3 3 3 3 2 1 1 2 1 2 2 3 3 1 1 2 1 1 2 1 1 3 2
## [316] 2 2 2 3 2 1 2 1 1 3 1 1 3 3 2 2 2 2 1 3 1
##
## Within cluster sum of squares by cluster:
## [1] 566.2729 681.6656 528.4083
## (between_SS / total_SS = 33.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
sil3 <- silhouette(km_model$cluster, dist(df_z))
summary(sil3)
## Silhouette of 336 units in 3 clusters from silhouette.default(x = km_model$cluster, dist = dist(df_z)) :
## Cluster sizes and average silhouette widths:
## 148 110 78
## 0.2952418 0.0968107 0.1579464
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.11707 0.09749 0.19106 0.19841 0.29783 0.46749
plot(sil3, col=1:length(km_model$size), border=NA)

km_model$centers
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 -0.4399046 -0.5044645 -0.4307171 -0.7562568 -0.4395881 -0.6866542
## 2 -0.4291739 0.2404463 0.2037941 0.7627544 0.2753585 0.7769629
## 3 1.4399360 0.6180980 0.5298561 0.3592695 0.4457642 0.2071655
## DiabetesPedigreeFunction Age
## 1 -0.12608667 -0.5341434
## 2 0.13348950 -0.2920598
## 3 0.05098695 1.4253819
par(mfrow=c(1, 1), mar=c(4, 4, 4, 2))
myColors <- c("darkblue", "red", "green", "brown", "pink", "purple", "yellow", "orange")
barplot(t(km_model$centers), beside = TRUE, xlab="cluster", ylab="value", col = myColors)
legend("top", ncol=2, legend = c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"), fill = myColors)

df_km <- df
df_km$clusters <- km_model$cluster
ggplot(df_km, aes(Glucose, BloodPressure), main="Scatterplot: Glucose vs BloodPressure") +
geom_point(aes(colour = factor(clusters), shape=factor(clusters), stroke = 8), alpha=1) +
theme_bw(base_size=25) +
geom_text(aes(label=ifelse(clusters%in%1, as.character(clusters), ''), hjust=2, vjust=2, colour = factor(clusters)))+
geom_text(aes(label=ifelse(clusters%in%2, as.character(clusters), ''), hjust=-2, vjust=2, colour = factor(clusters)))+
geom_text(aes(label=ifelse(clusters%in%3, as.character(clusters), ''), hjust=2, vjust=-1, colour = factor(clusters))) +
guides(colour = guide_legend(override.aes = list(size=8))) +
theme(legend.position="top")

kpp_init = function(dat, K) {
x = as.matrix(dat)
n = nrow(x)
# Randomly choose a first center
centers = matrix(NA, nrow=K, ncol=ncol(x))
set.seed(123)
centers[1,] = as.matrix(x[sample(1:n, 1),])
for (k in 2:K) {
# Calculate dist^2 to closest center for each point
dists = matrix(NA, nrow=n, ncol=k-1)
for (j in 1:(k-1)) {
temp = sweep(x, 2, centers[j,], '-')
dists[,j] = rowSums(temp^2)
}
dists = rowMins(dists)
# Draw next center with probability proportional to dist^2
cumdists = cumsum(dists)
prop = runif(1, min=0, max=cumdists[n])
centers[k,] = as.matrix(x[min(which(cumdists > prop)),])
}
return(centers)
}
kmp_model <- kmeans(df_z, kpp_init(df_z, 3), iter.max=100, algorithm='Lloyd')
kmp_model
## K-means clustering with 3 clusters of sizes 113, 145, 78
##
## Cluster means:
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 -0.4193360 0.1838024 0.2100280 0.7312880 0.1782657 0.7516222
## 2 -0.4390313 -0.5118013 -0.4475874 -0.7719095 -0.4446160 -0.6982706
## 3 1.4236475 0.6851477 0.5277822 0.3755299 0.5682731 0.2091786
## DiabetesPedigreeFunction Age
## 1 0.06213071 -0.2737365
## 2 -0.11777023 -0.5428807
## 3 0.12892195 1.4057683
##
## Clustering vector:
## [1] 2 2 3 3 3 1 1 1 3 3 2 3 1 2 2 3 1 3 2 2 2 3 3 3 2 2 1 2 1 1 2 1 2 3 2
## [36] 1 2 3 2 2 1 2 2 2 2 1 3 2 3 2 1 1 1 1 1 2 1 2 2 1 2 1 1 1 3 1 2 2 2 3
## [71] 3 2 2 1 1 2 3 3 1 3 1 3 1 2 1 1 2 3 3 2 3 3 1 2 3 2 2 3 3 2 2 3 2 2 1
## [106] 3 2 3 2 3 2 2 2 1 1 2 3 3 3 3 1 2 1 1 1 1 1 3 1 1 1 3 1 2 2 2 2 1 2 3
## [141] 2 1 1 2 2 2 3 2 2 3 2 2 2 1 3 1 1 1 2 2 1 1 1 1 3 1 2 2 2 2 2 3 2 1 2
## [176] 2 2 2 1 1 1 1 1 2 1 2 1 2 3 1 1 1 2 2 2 2 2 2 2 2 3 3 3 1 1 1 2 3 1 2
## [211] 1 2 2 2 3 3 2 2 2 2 2 2 3 3 2 1 2 2 2 1 2 1 3 1 1 2 3 3 2 1 2 2 2 3 1
## [246] 2 2 1 3 1 2 2 1 1 2 3 3 1 2 1 2 2 3 1 2 2 2 1 3 3 2 1 1 2 2 1 2 2 1 2
## [281] 3 2 2 1 2 1 2 1 1 3 3 1 3 3 3 3 1 2 2 1 2 1 1 3 3 2 2 1 2 2 1 2 2 3 1
## [316] 1 1 1 1 1 2 1 2 2 3 2 2 3 3 1 1 1 1 2 3 2
##
## Within cluster sum of squares by cluster:
## [1] 638.5022 554.0026 585.1043
## (between_SS / total_SS = 33.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
kmp_model$centers
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 -0.4193360 0.1838024 0.2100280 0.7312880 0.1782657 0.7516222
## 2 -0.4390313 -0.5118013 -0.4475874 -0.7719095 -0.4446160 -0.6982706
## 3 1.4236475 0.6851477 0.5277822 0.3755299 0.5682731 0.2091786
## DiabetesPedigreeFunction Age
## 1 0.06213071 -0.2737365
## 2 -0.11777023 -0.5428807
## 3 0.12892195 1.4057683
sil3 <- silhouette(kmp_model$cluster, dist(df_z))
summary(sil3)
## Silhouette of 336 units in 3 clusters from silhouette.default(x = kmp_model$cluster, dist = dist(df_z)) :
## Cluster sizes and average silhouette widths:
## 113 145 78
## 0.1127129 0.2835945 0.1335516
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.08993 0.09269 0.18783 0.19129 0.28566 0.45646
plot(sil3, col=1:length(kmp_model$size), border=NA)

n_rows <- 21
mat <- matrix(0,nrow = n_rows)
for (i in 2:n_rows){
set.seed(1234)
kmp_model <- kmeans(df_z, kpp_init(df_z, i), iter.max=100, algorithm='Lloyd')
sil <- silhouette(kmp_model$cluster, dist(df_z))
mat[i] <- mean(as.matrix(sil)[,3])
}
colnames(mat) <- c("Avg_Silhouette_Value")
mat
## Avg_Silhouette_Value
## [1,] 0.0000000
## [2,] 0.2323945
## [3,] 0.1912940
## [4,] 0.1479250
## [5,] 0.1344793
## [6,] 0.1230041
## [7,] 0.1398260
## [8,] 0.1371855
## [9,] 0.1380133
## [10,] 0.1380535
## [11,] 0.1231905
## [12,] 0.1307526
## [13,] 0.1276510
## [14,] 0.1270141
## [15,] 0.1262650
## [16,] 0.1226770
## [17,] 0.1185223
## [18,] 0.1159048
## [19,] 0.1172251
## [20,] 0.1165688
## [21,] 0.1156074
ggplot(data.frame(k=2:n_rows,sil=mat[2:n_rows]),aes(x=k,y=sil)) + geom_line() + scale_x_continuous(breaks = 2:n_rows)

k <- 2
set.seed(1234)
kmp2_model <- kmeans(df_z, kpp_init(df_z, k), iter.max=200, algorithm="MacQueen")
sil2 <- silhouette(kmp2_model$cluster, dist(df_z))
summary(sil2)
## Silhouette of 336 units in 2 clusters from silhouette.default(x = kmp2_model$cluster, dist = dist(df_z)) :
## Cluster sizes and average silhouette widths:
## 142 194
## 0.09200984 0.33515034
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1282 0.1250 0.2309 0.2324 0.3793 0.5027
plot(sil2, col=1:length(kmp2_model$size), border=NA)

k <- 4
set.seed(1234)
kmp4_model <- kmeans(df_z, kpp_init(df_z, k), iter.max=200, algorithm="MacQueen")
sil4 <- silhouette(kmp4_model$cluster, dist(df_z))
summary(sil4)
## Silhouette of 336 units in 4 clusters from silhouette.default(x = kmp4_model$cluster, dist = dist(df_z)) :
## Cluster sizes and average silhouette widths:
## 103 111 50 72
## 0.14745426 0.20923738 0.02783948 0.13249982
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.15367 0.06759 0.15000 0.14686 0.22068 0.39826
plot(sil4, col=1:length(kmp4_model$size), border=NA)

k <- 7
set.seed(1234)
kmp7_model <- kmeans(df_z, kpp_init(df_z, k), iter.max=200, algorithm="MacQueen")
sil7 <- silhouette(kmp7_model$cluster, dist(df_z))
summary(sil7)
## Silhouette of 336 units in 7 clusters from silhouette.default(x = kmp7_model$cluster, dist = dist(df_z)) :
## Cluster sizes and average silhouette widths:
## 45 41 35 59 20 50
## 0.15087837 0.13186039 0.04067210 0.11255464 0.07534761 0.16593818
## 86
## 0.18901895
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.12311 0.05225 0.12866 0.13786 0.22776 0.41285
plot(sil7, col=1:length(kmp7_model$size), border=NA)

k <- 8
set.seed(1234)
kmp8_model <- kmeans(df_z, kpp_init(df_z, k), iter.max=200, algorithm="MacQueen")
sil8 <- silhouette(kmp8_model$cluster, dist(df_z))
summary(sil8)
## Silhouette of 336 units in 8 clusters from silhouette.default(x = kmp8_model$cluster, dist = dist(df_z)) :
## Cluster sizes and average silhouette widths:
## 55 42 25 60 19 45
## 0.12746487 0.12542888 0.08668730 0.10523802 0.05696327 0.18514059
## 34 56
## 0.12806721 0.23459332
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.13982 0.04988 0.13604 0.14186 0.22597 0.43003
plot(sil8, col=1:length(kmp8_model$size), border=NA)

k <- 11
set.seed(1234)
kmp11_model <- kmeans(df_z, kpp_init(df_z, k), iter.max=200, algorithm="MacQueen")
sil11 <- silhouette(kmp11_model$cluster, dist(df_z))
summary(sil11)
## Silhouette of 336 units in 11 clusters from silhouette.default(x = kmp11_model$cluster, dist = dist(df_z)) :
## Cluster sizes and average silhouette widths:
## 42 60 26 25 11 29
## 0.17941366 0.23337868 0.05755744 0.09833683 -0.02482009 0.12000844
## 23 15 26 38 41
## 0.12526127 0.11506333 0.11742627 0.11311238 0.13706808
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.19813 0.06575 0.13136 0.13773 0.22522 0.44301
plot(sil11, col=1:length(kmp11_model$size), border=NA)

cat("\nFrom the above results, the best value for the K parameter would be 7.")
##
## From the above results, the best value for the K parameter would be 7.
Using the Generalized Linear Model for Performing Logistic Regression –
glm_model <- glm(Outcome~., family = binomial(link = 'logit'), data = train)
summary(glm_model)
##
## Call:
## glm(formula = Outcome ~ ., family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7014 -0.5429 -0.2738 0.5669 2.9094
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.721212 1.945110 -6.540 6.15e-11 ***
## Pregnancies 0.119194 0.075677 1.575 0.115249
## Glucose 0.037403 0.007707 4.853 1.21e-06 ***
## BloodPressure 0.015244 0.017823 0.855 0.392405
## SkinThickness 0.006066 0.023429 0.259 0.795700
## Insulin 0.001375 0.001910 0.720 0.471642
## BMI 0.090099 0.042611 2.114 0.034478 *
## DiabetesPedigreeFunction 2.017052 0.595117 3.389 0.000701 ***
## Age 0.032996 0.024002 1.375 0.169215
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 291.22 on 234 degrees of freedom
## Residual deviance: 181.56 on 226 degrees of freedom
## AIC: 199.56
##
## Number of Fisher Scoring iterations: 5
glm_pred <- predict(glm_model, test, type = "response")
table(glm_pred>0.5, test$Outcome)
##
## Negative Positive
## FALSE 57 19
## TRUE 6 19
glm_acc <- (table(glm_pred>0.5, test$Outcome)[1] + table(glm_pred>0.5, test$Outcome)[4]) / (table(glm_pred>0.5, test$Outcome)[1] + table(glm_pred>0.5, test$Outcome)[2] + table(glm_pred>0.5, test$Outcome)[3] + table(glm_pred>0.5, test$Outcome)[4])
glm_speci <- table(glm_pred>0.5, test$Outcome)[4] / (table(glm_pred>0.5, test$Outcome)[2] + table(glm_pred>0.5, test$Outcome)[4])
glm_sensi <- table(glm_pred>0.5, test$Outcome)[1] / (table(glm_pred>0.5, test$Outcome)[1] + table(glm_pred>0.5, test$Outcome)[3])
cat("\nAccuracy:",glm_acc)
##
## Accuracy: 0.7524752
cat("\nSpecificity:",glm_speci)
##
## Specificity: 0.76
cat("\nSensitivity:",glm_sensi)
##
## Sensitivity: 0.75
Using the Gradient Boosting Method for Modeling –
gbm_model <- gbm(Outcome~., data = train, distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 4, bag.fraction = 0.5, train.fraction = 0.5, n.minobsinnode = 10, cv.folds = 3, keep.data = TRUE, verbose = FALSE, n.cores = 1)
best_iteration <- gbm.perf(gbm_model, method = "cv", plot.it = FALSE)
fit_control <- trainControl(method = "cv", number = 5, returnResamp = "all")
gbm_final_model <- train(Outcome~., data = train, method = "gbm", distribution = "bernoulli", trControl = fit_control, verbose = F, tuneGrid = data.frame(.n.trees = best_iteration, .shrinkage = 0.01, .interaction.depth = 1, .n.minobsinnode = 1))
gbm_pred <- predict(gbm_final_model, test)
cm_gbm_orig <- confusionMatrix(table(gbm_pred, test$Outcome))
cm_gbm_orig
## Confusion Matrix and Statistics
##
##
## gbm_pred Negative Positive
## Negative 55 18
## Positive 8 20
##
## Accuracy : 0.7426
## 95% CI : (0.646, 0.8244)
## No Information Rate : 0.6238
## P-Value [Acc > NIR] : 0.007895
##
## Kappa : 0.4213
## Mcnemar's Test P-Value : 0.077556
##
## Sensitivity : 0.8730
## Specificity : 0.5263
## Pos Pred Value : 0.7534
## Neg Pred Value : 0.7143
## Prevalence : 0.6238
## Detection Rate : 0.5446
## Detection Prevalence : 0.7228
## Balanced Accuracy : 0.6997
##
## 'Positive' Class : Negative
##
Using a Neural Network Model (Error Decreases with Increase in # Hidden Nodes and Layers) –
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
nn_train <- as.data.frame(lapply(train[,-9], normalize))
nn_train$Outcome <- ifelse(train$Outcome == "Positive", 1, 0)
nn_test <- as.data.frame(lapply(test[,-9], normalize))
nn_test$Outcome <- ifelse(test$Outcome == "Positive", 1, 0)
nn_model <- neuralnet(Outcome~Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction+Age, data = nn_train, hidden = 1, stepmax = 1e6)
plot(nn_model, rep = "best")

nn_pred <- compute(nn_model, nn_test[,-9])
pred_results <- nn_pred$net.result
cor(pred_results, nn_test$Outcome)
## [,1]
## [1,] 0.4844594571
nnp_model <- neuralnet(Outcome~Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction+Age, data = nn_train, hidden = 10, stepmax = 1e6)
plot(nnp_model, rep = "best")

nnp_pred <- compute(nnp_model, nn_test[,-9])
predp_results <- nnp_pred$net.result
cor(predp_results, nn_test$Outcome)
## [,1]
## [1,] 0.392176431
nnph_model <- neuralnet(Outcome~Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction+Age, data = nn_train, hidden = c(10,10,10), stepmax = 1e6)
plot(nnph_model, rep = "best")

nnph_pred <- compute(nnph_model, nn_test[,-9])
predph_results <- nnph_pred$net.result
cor(predph_results, nn_test$Outcome)
## [,1]
## [1,] 0.2545543719
Using a Support Vector Machine Model (Radial, Linear & Laplacian) –
set.seed(1234)
svm_rbf_model <- ksvm(Outcome~., data = train, kernel = "rbfdot")
svm_rbf_model
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.110362872848112
##
## Number of Support Vectors : 130
##
## Objective Function Value : -90.3376
## Training error : 0.13617
svm_rbf_pred <- predict(svm_rbf_model, test)
cm_svm_rbf <- confusionMatrix(table(svm_rbf_pred, test$Outcome))
cm_svm_rbf
## Confusion Matrix and Statistics
##
##
## svm_rbf_pred Negative Positive
## Negative 53 19
## Positive 10 19
##
## Accuracy : 0.7128713
## 95% CI : (0.6143106, 0.798545)
## No Information Rate : 0.6237624
## P-Value [Acc > NIR] : 0.03857959
##
## Kappa : 0.3580977
## Mcnemar's Test P-Value : 0.13739483
##
## Sensitivity : 0.8412698
## Specificity : 0.5000000
## Pos Pred Value : 0.7361111
## Neg Pred Value : 0.6551724
## Prevalence : 0.6237624
## Detection Rate : 0.5247525
## Detection Prevalence : 0.7128713
## Balanced Accuracy : 0.6706349
##
## 'Positive' Class : Negative
##
set.seed(1234)
svm_linear_model <- ksvm(Outcome~., data = train, kernel = "vanilladot")
## Setting default kernel parameters
svm_linear_model
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 103
##
## Objective Function Value : -98.9642
## Training error : 0.187234
svm_linear_pred <- predict(svm_linear_model, test)
cm_svm_linear <- confusionMatrix(table(svm_linear_pred, test$Outcome))
cm_svm_linear
## Confusion Matrix and Statistics
##
##
## svm_linear_pred Negative Positive
## Negative 54 19
## Positive 9 19
##
## Accuracy : 0.7227723
## 95% CI : (0.6248177, 0.8072313)
## No Information Rate : 0.6237624
## P-Value [Acc > NIR] : 0.02376890
##
## Kappa : 0.376818
## Mcnemar's Test P-Value : 0.08897301
##
## Sensitivity : 0.8571429
## Specificity : 0.5000000
## Pos Pred Value : 0.7397260
## Neg Pred Value : 0.6785714
## Prevalence : 0.6237624
## Detection Rate : 0.5346535
## Detection Prevalence : 0.7227723
## Balanced Accuracy : 0.6785714
##
## 'Positive' Class : Negative
##
set.seed(1234)
svm_laplace_model <- ksvm(Outcome~., data = train, kernel = "laplacedot")
svm_laplace_model
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Laplace kernel function.
## Hyperparameter : sigma = 0.110362872848112
##
## Number of Support Vectors : 135
##
## Objective Function Value : -103.7358
## Training error : 0.140426
svm_laplace_pred <- predict(svm_laplace_model, test)
cm_svm_lapl <- confusionMatrix(table(svm_laplace_pred, test$Outcome))
cm_svm_lapl
## Confusion Matrix and Statistics
##
##
## svm_laplace_pred Negative Positive
## Negative 54 18
## Positive 9 20
##
## Accuracy : 0.7326733
## 95% CI : (0.6353758, 0.8158651)
## No Information Rate : 0.6237624
## P-Value [Acc > NIR] : 0.01401434
##
## Kappa : 0.4023669
## Mcnemar's Test P-Value : 0.12365771
##
## Sensitivity : 0.8571429
## Specificity : 0.5263158
## Pos Pred Value : 0.7500000
## Neg Pred Value : 0.6896552
## Prevalence : 0.6237624
## Detection Rate : 0.5346535
## Detection Prevalence : 0.7128713
## Balanced Accuracy : 0.6917293
##
## 'Positive' Class : Negative
##
Comparing the Accuracy, Sensitivity and Specificity of the Models –
tab_results <- data.frame(
Predictive_Model = c("Original C5.0", "Tuned 5.0", "Original R-PART", "Tuned R-PART", "One R Model", "JRip Model", "Original Naive Bayes", "Laplacian Naive Bayes", "Classification Tree", "LDA Model", "Original Random Forest", "Tuned Random Forest", "Logistic Regression", "Gradient Boosting Model", "Gaussian SVM", "Laplacian SVM"),
Accuracy = c(round(cm_c5_orig$overall[1],6), round(cm_c5_boost$overall[1],6), round(cm_rp_orig$overall[1],6), round(cm_rp_tune$overall[1],6), round(cm_oneR_orig$overall[1],6), round(cm_jrip_orig$overall[1],6), round(cm_nb_orig$overall[1],6), round(cm_nb_lapl$overall[1],6), round(cm_ctree_orig$overall[1],6), round(cm_lda_orig$overall[1],6), round(cm_rf_orig$overall[1],6), round(cm_rf_tune$overall[1],6), round(glm_acc,6), round(cm_gbm_orig$overall[1],6), round(cm_svm_rbf$overall[1],6), round(cm_svm_lapl$overall[1],6)),
Sensitivity = c(round(cm_c5_orig$table[1]/(cm_c5_orig$table[1]+cm_c5_orig$table[3]), 6),
round(cm_c5_boost$table[1]/(cm_c5_boost$table[1]+cm_c5_boost$table[3]), 6),
round(cm_rp_orig$table[1]/(cm_rp_orig$table[1]+cm_rp_orig$table[3]), 6),
round(cm_rp_tune$table[1]/(cm_rp_tune$table[1]+cm_rp_tune$table[3]), 6),
round(cm_oneR_orig$table[1]/(cm_oneR_orig$table[1]+cm_oneR_orig$table[3]), 6),
round(cm_jrip_orig$table[1]/(cm_jrip_orig$table[1]+cm_jrip_orig$table[3]), 6),
round(cm_nb_orig$table[1]/(cm_nb_orig$table[1]+cm_nb_orig$table[3]), 6),
round(cm_nb_lapl$table[1]/(cm_nb_lapl$table[1]+cm_nb_lapl$table[3]), 6),
round(cm_ctree_orig$table[1]/(cm_ctree_orig$table[1]+cm_ctree_orig$table[3]), 6),
round(cm_lda_orig$table[1]/(cm_lda_orig$table[1]+cm_lda_orig$table[3]), 6),
round(cm_rf_orig$table[1]/(cm_rf_orig$table[1]+cm_rf_orig$table[3]), 6),
round(cm_rf_tune$table[1]/(cm_rf_tune$table[1]+cm_rf_tune$table[3]), 6),
round(glm_sensi, 6),
round(cm_gbm_orig$table[1]/(cm_gbm_orig$table[1]+cm_gbm_orig$table[3]), 6),
round(cm_svm_rbf$table[1]/(cm_svm_rbf$table[1]+cm_svm_rbf$table[3]), 6),
round(cm_svm_lapl$table[1]/(cm_svm_lapl$table[1]+cm_svm_lapl$table[3]), 6)
),
Specificity = c(round(cm_c5_orig$table[4]/(cm_c5_orig$table[2]+cm_c5_orig$table[4]), 6),
round(cm_c5_boost$table[4]/(cm_c5_boost$table[2]+cm_c5_boost$table[4]), 6),
round(cm_rp_orig$table[4]/(cm_rp_orig$table[2]+cm_rp_orig$table[4]), 6),
round(cm_rp_tune$table[4]/(cm_rp_tune$table[2]+cm_rp_tune$table[4]), 6),
round(cm_oneR_orig$table[4]/(cm_oneR_orig$table[2]+cm_oneR_orig$table[4]), 6),
round(cm_jrip_orig$table[4]/(cm_jrip_orig$table[2]+cm_jrip_orig$table[4]), 6),
round(cm_nb_orig$table[4]/(cm_nb_orig$table[2]+cm_nb_orig$table[4]), 6),
round(cm_nb_lapl$table[4]/(cm_nb_lapl$table[2]+cm_nb_lapl$table[4]), 6),
round(cm_ctree_orig$table[4]/(cm_ctree_orig$table[2]+cm_ctree_orig$table[4]), 6),
round(cm_lda_orig$table[4]/(cm_lda_orig$table[2]+cm_lda_orig$table[4]), 6),
round(cm_rf_orig$table[4]/(cm_rf_orig$table[2]+cm_rf_orig$table[4]), 6),
round(cm_rf_tune$table[4]/(cm_rf_tune$table[2]+cm_rf_tune$table[4]), 6),
round(glm_speci, 6),
round(cm_gbm_orig$table[4]/(cm_gbm_orig$table[2]+cm_gbm_orig$table[4]), 6),
round(cm_svm_rbf$table[4]/(cm_svm_rbf$table[2]+cm_svm_rbf$table[4]), 6),
round(cm_svm_lapl$table[4]/(cm_svm_lapl$table[2]+cm_svm_lapl$table[4]), 6)
)
)
kable(tab_results, "html") %>%
kable_styling(bootstrap_options = "striped", font_size = 12) %>%
row_spec(c(2,8,10,11,12,13), bold = TRUE, color = "white", background = "green") %>%
row_spec(7, bold = TRUE, color = "white", background = "blue")
|
Predictive_Model
|
Accuracy
|
Sensitivity
|
Specificity
|
|
Original C5.0
|
0.712871
|
0.723684
|
0.680000
|
|
Tuned 5.0
|
0.772277
|
0.812500
|
0.702703
|
|
Original R-PART
|
0.722772
|
0.733333
|
0.692308
|
|
Tuned R-PART
|
0.742574
|
0.768116
|
0.687500
|
|
One R Model
|
0.712871
|
0.736111
|
0.655172
|
|
JRip Model
|
0.702970
|
0.761905
|
0.605263
|
|
Original Naive Bayes
|
0.970297
|
0.983871
|
0.948718
|
|
Laplacian Naive Bayes
|
0.792079
|
0.828125
|
0.729730
|
|
Classification Tree
|
0.732673
|
0.790323
|
0.641026
|
|
LDA Model
|
0.752475
|
0.750000
|
0.760000
|
|
Original Random Forest
|
0.752475
|
0.756757
|
0.740741
|
|
Tuned Random Forest
|
0.762376
|
0.767123
|
0.750000
|
|
Logistic Regression
|
0.752475
|
0.750000
|
0.760000
|
|
Gradient Boosting Model
|
0.742574
|
0.753425
|
0.714286
|
|
Gaussian SVM
|
0.712871
|
0.736111
|
0.655172
|
|
Laplacian SVM
|
0.732673
|
0.750000
|
0.689655
|
col <- c("yellow", "darkblue")
par(mfrow=c(2,2))
fourfoldplot(cm_c5_orig$table, color = col, conf.level = 0, margin = 1, main=paste("Original C5.0 (",round(cm_c5_orig$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_c5_boost$table, color = col, conf.level = 0, margin = 1, main=paste("Tuned C5.0 (",round(cm_c5_boost$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_rp_orig$table, color = col, conf.level = 0, margin = 1, main=paste("Original R-PART (",round(cm_rp_orig$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_rp_tune$table, color = col, conf.level = 0, margin = 1, main=paste("Tuned R-PART (",round(cm_rp_tune$overall[1]*100),"%)",sep=""))

fourfoldplot(cm_oneR_orig$table, color = col, conf.level = 0, margin = 1, main=paste("One R Model (",round(cm_oneR_orig$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_jrip_orig$table, color = col, conf.level = 0, margin = 1, main=paste("JRip Model (",round(cm_jrip_orig$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_nb_orig$table, color = col, conf.level = 0, margin = 1, main=paste("Original Naive Bayes (",round(cm_nb_orig$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_nb_lapl$table, color = col, conf.level = 0, margin = 1, main=paste("Laplacian Naive Bayes (",round(cm_nb_lapl$overall[1]*100),"%)",sep=""))

fourfoldplot(cm_ctree_orig$table, color = col, conf.level = 0, margin = 1, main=paste("Classification Tree (",round(cm_ctree_orig$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_lda_orig$table, color = col, conf.level = 0, margin = 1, main=paste("LDA Model (",round(cm_lda_orig$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_rf_orig$table, color = col, conf.level = 0, margin = 1, main=paste("Original Random Forest (",round(cm_rf_orig$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_rf_tune$table, color = col, conf.level = 0, margin = 1, main=paste("Tuned Random Forest (",round(cm_rf_tune$overall[1]*100),"%)",sep=""))

fourfoldplot(table(glm_pred>0.5, test$Outcome), color = col, conf.level = 0, margin = 1, main=paste("Logistic Regression (",round(glm_acc*100),"%)",sep=""))
fourfoldplot(cm_gbm_orig$table, color = col, conf.level = 0, margin = 1, main=paste("Gradient Boosting Model (",round(cm_gbm_orig$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_svm_rbf$table, color = col, conf.level = 0, margin = 1, main=paste("Gaussian SVM (",round(cm_svm_rbf$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_svm_lapl$table, color = col, conf.level = 0, margin = 1, main=paste("Laplacian SVM (",round(cm_svm_lapl$overall[1]*100),"%)",sep=""))

DISCUSSION OF RESULTS
## IMPORTANT NOTE: The BEST PERFORMING model has been highlighted in BLUE COLOR amongst the TOP PERFORMING models that have been highlighted in GREEN COLOR.
##
## From the above results, we can conclude that the Naive Bayes Model (without Laplacian smoothing) performs the best with a staggering 97% accuracy as compared to the other models. Surprisingly, Laplacian smoothing causes the accuracy of this model to decrease to 79% with a parameter value of 50. The second-best model from the above list is the Tuned C5.0 model, which has an accuracy of 77%. The third-best model is the Tuned Random Forest Model, having an accuracy of 76%. The models with the highest accuracy are considered to be the best [13].
##
## In regard to the sensitivity of the models, the Naive Bayes Model (without Laplacian smoothing) has the highest sensitivity of 96% followed by the Linear Discriminant Analysis (LDA) Model having a sensitivity of 90%. The Random Forest Model ranks third with a sensitivity of 88%. The higher the sensitivity of a model, the larger is its true positive rate and the better is its recall. The Type-I Error of such a model will likely be low [13].
##
## With respect to the specificity of the models, the Naive Bayes Model (without Laplacian smoothing) has the highest specificity of 97% followed by the Logistic Regression Model having a specificity of 76%. The Naive Bayes Model (with a laplacian smoothing parameter = 50) ranks third with a specificity of 71%. The higher the specificity of a model, the larger is its true negative rate. The Type-II Error of such a model will likely be low [13].
CONCLUSION
## From the above discussion, we can conclude that the Naive Bayes Model without Laplacian smoothing is best-suited for this bioinformatics application. This is because this model has the highest accuracy, sensitivity and specificity amongst all of the above models. Indirectly, this model would have the lowest Type-I and Type-II errors too. Since we know the factors that can contribute to diabetes from the exploratory analysis, we can successfully build a predictive model to detect the onset of diabetes mellitus in patients based on the patients' physical attributes and above results.
ACKNOWLEDGEMENTS
## I would like to sincerely thank Prof. Ivo D. Dinov for all his encouragement and support in enabling me excel in this course. The homework assignments and in-class activities allowed me to understand much of the material, which ultimately helped me implement this whole project on my own. I would highly recommend taking this class on Data Science and Predictive Analytics (DSPA) if you are a student or professional interested in advancing your knowledge about using R to perform exploratory analyses and implement machine learning algorithms on your own. Thank you very much for such good course material.
REFERENCES
## 1. 'About diabetes'. World Health Organization. Archived from the original on 31 March 2014. Retrieved 4 April 2014.
##
## 2. 'Diabetes Fact sheet N°312'. WHO. October 2013. Archived from the original on 26 August 2013. Retrieved 25 March 2014.
##
## 3. 'Update 2015'. ID F. International Diabetes Federation. p. 13. Archived from the original on 22 March 2016. Retrieved 21 March 2016.
##
## 4. Williams textbook of endocrinology (12th ed.). Elsevier/Saunders. pp. 1371–1435. ISBN 978-1-4377-0324-5.
##
## 5. Shi Y, Hu FB (June 2014). 'The global implications of diabetes and cancer'. Lancet. 383 (9933): 1947–8. doi:10.1016/S0140-6736(14)60886-2. PMID 24910221.
##
## 6. Vos T, Flaxman AD, Naghavi M, Lozano R, Michaud C, Ezzati M, et al. (December 2012). 'Years lived with disability (YLDs) for 1160 sequelae of 289 diseases and injuries 1990-2010: a systematic analysis for the Global Burden of Disease Study 2010'. Lancet. 380 (9859): 2163–96. doi:10.1016/S0140-6736(12)61729-2. PMID 23245607.
##
## 7. IDF DIABETES ATLAS (6th ed.). International Diabetes Federation. 2013. p. 7. ISBN 2930229853. Archived from the original (PDF) on 9 June 2014.
##
## 8. 'Economic costs of diabetes in the U.S. in 2012'. Diabetes Care. 36 (4): 1033–46. April 2013. doi:10.2337/dc12-2625. PMC 3609540. Freely accessible. PMID 23468086.
##
## 9. Ron Kohavi; Foster Provost (1998). 'Glossary of terms'. Machine Learning. 30: 271–274.
##
## 10. Pima Indians Diabetes - dataset by uci. (2017, August 16). Retrieved from https://data.world/uci/pima-indians-diabetes
##
## 11. Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
##
## 12. Dinov, I. (n.d.). Learning Modules. Retrieved from http://www.socr.umich.edu/people/dinov/courses/DSPA_Topics.html
##
## 13. Yang, C., Zou, Y., Liu, J., & Mulligan, K. (2015). Predictive model evaluation for PHM. International Journal of Prognostics and Health Management, 5(2), 1-11. Retrieved April 18, 2018, from https://nparc.nrc-cnrc.gc.ca/eng/view/object/?id=dce076fe-03db-4d8c-b097-5ca015aa414d.